@nockieboy,
I'M GONNA MURDER YOU!!!!!!!
Ok, you are god damn lucky I still have a Quartus 9.1 full install, with it's built in high speed logic simulator.
Let's begin.
Step 1. Simulate the 'gpu_dual_port_ram_INTEL.v'.
I made a project with only the ram with the .mif file like so:
I setup a simulation feeding the above inputs to see what the outputs would look like when beginning the read address just before the ASCII text begins where you have stored 0,1,2,3,4,5,...:
So far, the read of the data looks fine.
Step 2. Simulate the 'multiport_gpu_ram.v'.
I made the project feed the clock, ena, all 4 addresses, all 4 cmds, even the 'host' address & it's read results. (I used a 484 pin Cyclone III to get the IOs)
See in green the 'host' address wiring is taken from the second addr_in.
This is what the simulator gave me.
WTF? Addr ports 1 and 2 seem read to data out properly, yet, the reading of port addr4' data out has all 0s. Also, a weird thing, the 'HOST' data out, which is also fed the addr2 reads the right data, BUT, it suddenly goes 'UU' (undefined), then goes back to 0.
Without that 'HOST' data out, seeing the data being erased, I would have never found the bug.
In you code, you have:
// ****************************************************************************************************************************
// Dual-port GPU RAM
//
// Port A - read only by GPU
// Port B - read/writeable by host system
// Data buses - 8 bits / 1 byte wide
// Address buses - ADDR_SIZE wide (14 bits default)
// Memory word size - NUM_WORDS (16384 bytes default)
// ****************************************************************************************************************************
altsyncram altsyncram_component (
.clock0 (clk),
.wren_a (1'b1), ************************F--K******************
.address_b (addr_b[ADDR_SIZE - 1:0]),
.clock1 (clk_b),
.data_b (data_in_b),
.wren_b (wr_en_b),
.address_a (addr_a[ADDR_SIZE - 1:0]),
.data_a (8'b00000000),
.q_a (data_out_a),
.q_b (data_out_b),
.aclr0 (1'b0),
.aclr1 (1'b0),
.addressstall_a (1'b0),
.addressstall_b (1'b0),
.byteena_a (1'b1),
.byteena_b (1'b1),
.clocken0 (1'b1),
.clocken1 (1'b1),
.clocken2 (1'b1),
.clocken3 (1'b1),
.eccstatus (),
.rden_a (1'b1),
.rden_b (1'b1));
defparam
.wren_a (1'b1), **********F--K***********You made the Write Enable for the read address forced on!!!!!!!!
Every address we sent to be read was instead written clear to all 0's.
It gets worse, in you mux algorithm in the 'multiport_gpu_ram.v' module, you did this:
// perform 5:1 mux for all inputs to the dual-port RAM
case (pc_ena_in[2:0])
3'b000 : begin //******** Excellent, this is state 0 and you made the case 3'b000 which equals 0.
addr_in_mux <= addr_in_0;
cmd_mux_in <= cmd_in_0;
end
3'b001 : begin //******** Excellent, this is state 1 and you made the case 3'b001 which equals 1.
addr_in_mux <= addr_in_1;
cmd_mux_in <= cmd_in_1;
end
3'b011 : begin //******** Hun? What? This is state 2 and you made the case 3'b011 which equals 3?
addr_in_mux <= addr_in_2;
cmd_mux_in <= cmd_in_2;
end
3'b100 : begin //******** Hun? What? This is state 3 and you made the case 3'b100 which equals 4?
addr_in_mux <= addr_in_3;
cmd_mux_in <= cmd_in_3;
end
3'b101 : begin //******** Hun? What? This is state 4 and you made the case 3'b101 which equals 5?
addr_in_mux <= addr_in_4;
cmd_mux_in <= cmd_in_4;
end
endcase
After these 2 fixes, the simulation looks like this:
This is a zoom out of the simulation. As you can see, I'm reading 5 different ports at 5 different addresses simultaneously, with all the outputs coming in parallel.
Now, for the last little bit. When performing the 'mux' the only addition I wanted to do was 'snap' all the address_# and cmd_# inputs at (pc_ena_in==0), then feed those latched results to the ram. To do this, here is the simple addition I made to your 'case' statement in the 'mux'.
case (pc_ena_in[2:0])
3'b000 : begin
addr_in_mux <= addr_in_0; // Send the first, #0 addr & cmd to the memory module.
cmd_mux_in <= cmd_in_0;
addr_lat_1 <= addr_in_1; // latch all addr_in_# in parallel
addr_lat_2 <= addr_in_2;
addr_lat_3 <= addr_in_3;
addr_lat_4 <= addr_in_4;
cmd_lat_1 <= cmd_in_1; // latch all cmd_in_# in parallel
cmd_lat_2 <= cmd_in_2;
cmd_lat_3 <= cmd_in_3;
cmd_lat_4 <= cmd_in_4;
end
3'b001 : begin
addr_in_mux <= addr_lat_1; // Send the latched, #1 addr & cmd to the memory module.
cmd_mux_in <= cmd_lat_1;
end
3'b010 : begin
addr_in_mux <= addr_lat_2; // Send the latched, #2 addr & cmd to the memory module.
cmd_mux_in <= cmd_lat_2;
end
3'b011 : begin
addr_in_mux <= addr_lat_3; // Send the latched, #3 addr & cmd to the memory module.
cmd_mux_in <= cmd_lat_3;
end
3'b100 : begin
addr_in_mux <= addr_lat_4; // Send the latched, #4 addr & cmd to the memory module.
cmd_mux_in <= cmd_lat_4;
end
endcase
I've attached the latest source verilog files. Setting up 2 different Quartus' and preparing the simulation stimulus, plus finding that "Write Enable = 1" was a good 5 hours out of my day.
If you don't get a picture now, I will have no choice but to simulate your entire OSD project.