After initiating the "gpu_dual_port_ram_INTEL gpu_RAM(.....);", you need the:
---------------------------
defparam
gpu_RAM.MAX_ADDR_BITS = MAX_ADDR_BITS ;
---------------------------
This will pass the module multiport_gpu_ram's MAX_ADDR_BITS parameter into the gpu_dual_port_ram_INTEL's MAX_ADDR_BITS parameter. It may be useful to pass the 'altsyncram_component.numwords_a&b' since it may be possible to allocate 24kb in the FPGA since it has that much memory, yet not 32kb.
Okay, stupid question - altsyncram specifies altsyncram_component.numwords_a (and b) - I had 2 ** MAXSIZE in there, but if they're the number of words, I'll need to divide that by word size (, otherwise the RAM will (try to be) 8 times larger than what I think I'm specifying?
So, for example, this:Code: [Select]// define the memory size (number of words) - this allows RAM sizes other than multiples of 2
// but defaults to power-of-two sizing based on MAX_ADDR_BITS if not otherwise specified
parameter WORDS = 2 ** MAX_ADDR_BITS;
..needs to be this:Code: [Select]// define the memory size (number of words) - this allows RAM sizes other than multiples of 2
// but defaults to power-of-two sizing based on MAX_ADDR_BITS if not otherwise specified
parameter WORDS = (2 ** MAX_ADDR_BITS) / 8;
??
// address pass-thru bus (output)
output reg [19:0] addr_out, There are 5 of these to match the read address ins 0 through 5 in.
// auxilliary read command buses (input)
input [7:0] aux_read_0,
input [7:0] aux_read_1,
input [7:0] aux_read_2,
input [7:0] aux_read_3,
input [7:0] aux_read_4,
change all these to cmd_in[15:0]. (global search and replace)
// auxilliary read command buses (pass-thru output)
output reg [7:0] auxRdPT_0,
output reg [7:0] auxRdPT_1,
output reg [7:0] auxRdPT_2,
output reg [7:0] auxRdPT_3,
output reg [7:0] auxRdPT_4,
change these to cmd_out[15:0]
reg [MAX_ADDR_BITS - 1:0] address_mux;
change to reg [19:0] address_mux;
reg [7:0] aux_read_mux;
change to reg [15:0] cmd_read_mux (global search and replace)
These should all be present and correct now... I think. Got a little confused earlier with all the changes, so I'll be double-checking it all, but I think it would benefit from a close look.Your missing a few of the new ports for 'gpu_dual_port_ram_INTEL gpu_RAM(...);'
They should all be present and correct now.Almost done, next you will resort the read ram contents, the piped through address & cmds into their output registers and sync those to your new delayed 'pc_ena_out[3:0]' coming out of the Intel ram module.
Have made a bit of a start on this - the 5:1 mux code is modified according to my present understanding. The read address is passed through to the ram module, the pass-through address is passed out to the appropriate address bus according to the current mux step, as is the data read from memory.
I'm a little unsure about the command bus, though. It's piped into the memory via cmd_read_mux, but that seems like an unnecessary step as I only have one cmd_in bus (and one cmd_out bus) - should these be increased to 5 as well? It's possible I've misunderstood your instruction to 'change all these to cmd_in[15:0]'...
Note that we forgot to wire through the 'pc_ena_out[3:0]' coming out of the Intel ram module thought to the multiport_gpu_ram ( ...) ports, so that the rest of our graphics pipe heading to the output pins will incorporate the delay shift generated by the memory. (Though we can work around this through sophisticated re-syncing all the ram outputs back to the next pc_ena_in==0 cycle, this ena signal in the FPGA is beginning to drive so much logic limiting our FMAX, this is an opportune point to D-clock pipe the signals for the second half of our graphics pipe.)
Okay, I think I understand - but pc_ena passes through the gpu_dual_port_ram_INTEL module via a register pipe, which will fulfil the need to D-clock the signal, right?
gpu_dual_port_ram_INTEL.v:Code: [Select]module gpu_dual_port_ram_INTEL (
// inputs
input clk,
input [3:0] pc_ena_in,
input clk_b,
input wr_en_b,
input [19:0] addr_a,
input [19:0] addr_b,
input [7:0] data_in_b,
input [15:0] cmd_in,
// registered outputs
output reg [19:0] addr_out_a,
output reg [3:0] pc_ena_out,
output reg [15:0] cmd_out,
// direct outputs
output wire [7:0] data_out_a,
output wire [7:0] data_out_b
);
// define the maximum address bit
parameter ADDR_SIZE = 14; **********************************************************
// define the memory size (number of words) - this allows RAM sizes other than multiples of 2
// but defaults to power-of-two sizing based on ADDR_SIZE if not otherwise specified
parameter NUM_WORDS = 2 ** ADDR_SIZE; **********************************************************
// define delay pipe registers
reg [19:0] rd_addr_pipe_a;
reg [15:0] cmd_pipe;
reg [3:0] pc_ena_pipe;
// ****************************************************************************************************************************
// Dual-port GPU RAM
//
// Port A - read only by GPU
// Port B - read/writeable by host system
// Data buses - 8 bits / 1 byte wide
// Address buses - MAX_ADDR_BITS wide (14 bits default)
// Memory word size - 2^MAX_ADDR_BITS (16384 bytes default)
// ****************************************************************************************************************************
altsyncram altsyncram_component (
.clock0 (clk),
.wren_a (1'b1),
.address_b (addr_b[ADDR_SIZE-1:0]), ***************************************************************
.clock1 (clk_b),
.data_b (data_in_b),
.wren_b (wr_en_b),
.address_a (addr_a[ADDR_SIZE-1:0]), ****************************************************************************
.data_a (8'b00000000),
.q_a (data_out_a),
.q_b (data_out_b),
.aclr0 (1'b0),
.aclr1 (1'b0),
.addressstall_a (1'b0),
.addressstall_b (1'b0),
.byteena_a (1'b1),
.byteena_b (1'b1),
.clocken0 (1'b1),
.clocken1 (1'b1),
.clocken2 (1'b1),
.clocken3 (1'b1),
.eccstatus (),
.rden_a (1'b1),
.rden_b (1'b1));
defparam
altsyncram_component.address_reg_b = "CLOCK1",
altsyncram_component.clock_enable_input_a = "BYPASS",
altsyncram_component.clock_enable_input_b = "BYPASS",
altsyncram_component.clock_enable_output_a = "BYPASS",
altsyncram_component.clock_enable_output_b = "BYPASS",
altsyncram_component.indata_reg_b = "CLOCK1",
altsyncram_component.init_file = "../osd_mem.mif",
altsyncram_component.intended_device_family = "Cyclone IV E",
altsyncram_component.lpm_type = "altsyncram",
altsyncram_component.numwords_a = NUM_WORDS,
altsyncram_component.numwords_b = NUM_WORDS,
altsyncram_component.operation_mode = "BIDIR_DUAL_PORT",
altsyncram_component.outdata_aclr_a = "NONE",
altsyncram_component.outdata_aclr_b = "NONE",
altsyncram_component.outdata_reg_a = "CLOCK0",
altsyncram_component.outdata_reg_b = "CLOCK1",
altsyncram_component.power_up_uninitialized = "FALSE",
altsyncram_component.read_during_write_mode_port_a = "OLD_DATA",they're
altsyncram_component.read_during_write_mode_port_b = "OLD_DATA",
altsyncram_component.widthad_a = ADDR_SIZE, ********************************************************************
altsyncram_component.widthad_b = ADDR_SIZE, *********************************************************************
altsyncram_component.width_a = 8,
altsyncram_component.width_b = 8,
altsyncram_component.width_byteena_a = 1,
altsyncram_component.width_byteena_b = 1,
altsyncram_component.wrcontrol_wraddress_reg_b = "CLOCK1";
// ****************************************************************************************************************************
always @(posedge clk) begin
// **************************************************************************************************************************
// *** Create a serial pipe where the PIPE_DELAY parameter selects the pixel count delay for the xxx_in to the xxx_out ports
// **************************************************************************************************************************
rd_addr_pipe <= addr_a;
addr_out_a <= rd_addr_pipe;
cmd_pipe <= cmd_in;
cmd_out <= cmd_pipe;
pc_ena_pipe <= pc_ena_in;
pc_ena_out <= pc_ena_pipe;
// **************************************************************************************************************************
end
endmodule
multiport_gpu_ram.v:Code: [Select]module multiport_gpu_ram (
input clk, // Primary clk input (125 MHz)
input [3:0] pc_ena_in, // Pixel clock enable
input clk_b, // Host (Z80) clock input
input write_ena_b, // Host (Z80) clock enable
// address buses (input)
input [19:0] address_0,
input [19:0] address_1,
input [19:0] address_2,
input [19:0] address_3,
input [19:0] address_4,
input [19:0] addr_host,
// auxilliary read command buses (input)
input [15:0] cmd_in,
// outputs
output wire [3:0] pc_ena_out,
// address pass-thru bus (output)
output reg [19:0] addr_passthru_0,
output reg [19:0] addr_passthru_1,
output reg [19:0] addr_passthru_2,
output reg [19:0] addr_passthru_3,
output reg [19:0] addr_passthru_4,
output reg [19:0] addr_host_passthru,
// auxilliary read command bus (pass-thru output)
output reg [15:0] cmd_out, ************************************* NEED 5x cmd_out0/1/2/3/4 and we also need 5x cmd_in#
// data buses (output)
output reg [7:0] dataOUT_0,
output reg [7:0] dataOUT_1,
output reg [7:0] dataOUT_2,
output reg [7:0] dataOUT_3,
output reg [7:0] dataOUT_4,
output [7:0] data_host
);
// dual-port GPU RAM handler
// define the maximum address bits - effectively the RAM size
parameter ADDR_SIZE = 14; *******************************************
parameter NUM_WORDS = 2 ** ADDR_SIZE ; *******************************************
reg [19:0] address_mux;
reg [15:0] cmd_read_mux;
wire [19:0] addr_passthru_mux;
wire [7:0] data_mux;
// create a GPU RAM instance
gpu_dual_port_ram_INTEL gpu_RAM(
.clk(clk),
.pc_ena_in(pc_ena_in),
.clk_b(clk_b),
.wr_en_b(wr_en_b),
.addr_a(address_mux),
.addr_b(),
.data_in_b(),
.cmd_in(cmd_read_mux),
.addr_out_a(addr_passthru_mux),
.pc_ena_out(pc_ena_out),
.cmd_out(cmd_out),
.data_out_a(data_mux),
.data_out_b()
);
// pass MAX_ADDR_BITS into the gpu_RAM instance
defparam gpu_RAM.ADDR_SIZE = ADDR_SIZE, *************************************************************************
gpu_RAM.NUM_WORDS = NUM_WORDS ; // ************** Actual word count
always @(posedge clk) begin
// route non-muxed pass-throughs
cmd_read_mux <= cmd_in;
// perform 5:1 mux for all inputs to the dual-port RAM
case (pc_ena[2:0])
3'b000 : begin
address_mux <= address_0;
addr_passthru_0 <= addr_passthru_mux;
dataOUT_0 <= data_mux;
end
3'b001 : begin
address_mux <= address_1;
addr_passthru_1 <= addr_passthru_mux;
dataOUT_1 <= data_mux;
end
3'b011 : begin
address_mux <= address_2;
addr_passthru_2 <= addr_passthru_mux;
dataOUT_2 <= data_mux;
end
3'b100 : begin
address_mux <= address_3;
addr_passthru_3 <= addr_passthru_mux;
dataOUT_3 <= data_mux;
end
3'b101 : begin
address_mux <= address_4;
addr_passthru_4 <= addr_passthru_mux;
dataOUT_4 <= data_mux;
end
endcase
end // always @clk
endmodule
= ; // Adjust this figure to the number of PIXEL clock cycles it takes the demuxed output data to be ready.The 1 command bus is inside the INTEL dual port ram module. Just like the read addresses, it should be piped through in a single file fashion.
On the multiport GPU ram, there should be 5 groups going in, grouped with the 5 read addresses going in, and 5 grouped 16 bit cmd coming out, just like the 5 read datas, 5 read addresses, 5 cmd_outs, all in parallel...
Pipe it just like a read address and the auxiliary 16 bit cmd, delayed by two 125MHz clocks. The difference is when it comes back through the GPU multiport ram module, there it is not muxed, it's just wired through without delay.
Read all my ********************************************** in the 2 codes above
Next, re-assemble all the outputs of the INTEL dualport ram into 5 addresses, 5 datas, 5 cmds.
Helpful hint:
Since we want all the 5 outputs to parallel appear, each with the write contents when the input (pc_ena[2:0] == 0)...
...and you have a bunch of delays through this module where you can easily loose count of clocks cycles, especially if you need to make your mux take 2 or 3 clocks instead of 1 to help improve FMAX, make these local params and I'll leave it up to you to figure out how to implement them:
localparam CLK_CYCLES_MUX = 1; // Adjust this parameter to the number of 'clk' cycles it takes to select 1 of 5 muxed inputs
localparam CLK_CYCLES_RAM = 2; // Adjust this figure to the number of clock cycles the DP_ram takes to retrieve a valid data from the read address in.
Im not sure of this one, we will need to send this parameter back to the OSD generator so it know how many pixels to delay the H&V ena, and OSD ena to align the picture.
localparam CLK_CYCLES_PIXEL= ; // Adjust this figure to the number of PIXEL clock cycles it takes the demuxed output data to be ready.
module gpu_dual_port_ram_INTEL (
// inputs
input clk,
input [3:0] pc_ena_in,
input clk_b,
input wr_en_b,
input [19:0] addr_a,
input [19:0] addr_b,
input [7:0] data_in_b,
input [15:0] cmd_in,
// registered outputs
output reg [19:0] addr_out_a,
output reg [3:0] pc_ena_out,
output reg [15:0] cmd_out,
// direct outputs
output wire [7:0] data_out_a,
output wire [7:0] data_out_b
);
// define the maximum address bit
parameter ADDR_SIZE = 14;
// define the memory size (number of words) - this allows RAM sizes other than multiples of 2
// but defaults to power-of-two sizing based on ADDR_SIZE if not otherwise specified
parameter NUM_WORDS = 2 ** ADDR_SIZE;
// define delay pipe registers
reg [19:0] rd_addr_pipe_a;
reg [15:0] cmd_pipe;
reg [3:0] pc_ena_pipe;
// ****************************************************************************************************************************
// Dual-port GPU RAM
//
// Port A - read only by GPU
// Port B - read/writeable by host system
// Data buses - 8 bits / 1 byte wide
// Address buses - ADDR_SIZE wide (14 bits default)
// Memory word size - NUM_WORDS (16384 bytes default)
// ****************************************************************************************************************************
altsyncram altsyncram_component (
.clock0 (clk),
.wren_a (1'b1),
.address_b (addr_b[ADDR_SIZE:0]),
.clock1 (clk_b),
.data_b (data_in_b),
.wren_b (wr_en_b),
.address_a (addr_a[ADDR_SIZE:0]),
.data_a (8'b00000000),
.q_a (data_out_a),
.q_b (data_out_b),
.aclr0 (1'b0),
.aclr1 (1'b0),
.addressstall_a (1'b0),
.addressstall_b (1'b0),
.byteena_a (1'b1),
.byteena_b (1'b1),
.clocken0 (1'b1),
.clocken1 (1'b1),
.clocken2 (1'b1),
.clocken3 (1'b1),
.eccstatus (),
.rden_a (1'b1),
.rden_b (1'b1));
defparam
altsyncram_component.address_reg_b = "CLOCK1",
altsyncram_component.clock_enable_input_a = "BYPASS",
altsyncram_component.clock_enable_input_b = "BYPASS",
altsyncram_component.clock_enable_output_a = "BYPASS",
altsyncram_component.clock_enable_output_b = "BYPASS",
altsyncram_component.indata_reg_b = "CLOCK1",
altsyncram_component.init_file = "../osd_mem.mif",
altsyncram_component.intended_device_family = "Cyclone IV E",
altsyncram_component.lpm_type = "altsyncram",
altsyncram_component.numwords_a = NUM_WORDS,
altsyncram_component.numwords_b = NUM_WORDS,
altsyncram_component.operation_mode = "BIDIR_DUAL_PORT",
altsyncram_component.outdata_aclr_a = "NONE",
altsyncram_component.outdata_aclr_b = "NONE",
altsyncram_component.outdata_reg_a = "CLOCK0",
altsyncram_component.outdata_reg_b = "CLOCK1",
altsyncram_component.power_up_uninitialized = "FALSE",
altsyncram_component.read_during_write_mode_port_a = "OLD_DATA",they're
altsyncram_component.read_during_write_mode_port_b = "OLD_DATA",
altsyncram_component.widthad_a = ADDR_SIZE - 1,
altsyncram_component.widthad_b = ADDR_SIZE - 1,
altsyncram_component.width_a = 8,
altsyncram_component.width_b = 8,
altsyncram_component.width_byteena_a = 1,
altsyncram_component.width_byteena_b = 1,
altsyncram_component.wrcontrol_wraddress_reg_b = "CLOCK1";
// ****************************************************************************************************************************
always @(posedge clk) begin
// **************************************************************************************************************************
// *** Create a serial pipe where the PIPE_DELAY parameter selects the pixel count delay for the xxx_in to the xxx_out ports
// **************************************************************************************************************************
rd_addr_pipe <= addr_a;
addr_out_a <= rd_addr_pipe;
cmd_pipe <= cmd_in;
cmd_out <= cmd_pipe;
pc_ena_pipe <= pc_ena_in;
pc_ena_out <= pc_ena_pipe;
// **************************************************************************************************************************
end
endmodulemodule multiport_gpu_ram (
input clk, // Primary clk input (125 MHz)
input [3:0] pc_ena_in, // Pixel clock enable
input clk_b, // Host (Z80) clock input
input write_ena_b, // Host (Z80) clock enable
// address buses (input)
input [19:0] addr_in_0,
input [19:0] addr_in_1,
input [19:0] addr_in_2,
input [19:0] addr_in_3,
input [19:0] addr_in_4,
input [19:0] addr_host_in,
// auxilliary read command buses (input)
input [15:0] cmd_in_0,
input [15:0] cmd_in_1,
input [15:0] cmd_in_2,
input [15:0] cmd_in_3,
input [15:0] cmd_in_4,
// outputs
output wire [3:0] pc_ena_out,
// address pass-thru bus (output)
output reg [19:0] addr_out_0,
output reg [19:0] addr_out_1,
output reg [19:0] addr_out_2,
output reg [19:0] addr_out_3,
output reg [19:0] addr_out_4,
output reg [19:0] addr_host_out,
// auxilliary read command bus (pass-thru output)
output reg [15:0] cmd_out_0,
output reg [15:0] cmd_out_1,
output reg [15:0] cmd_out_2,
output reg [15:0] cmd_out_3,
output reg [15:0] cmd_out_4,
// data buses (output)
output reg [7:0] data_out_0,
output reg [7:0] data_out_1,
output reg [7:0] data_out_2,
output reg [7:0] data_out_3,
output reg [7:0] data_out_4,
output [7:0] data_host_out
);
// dual-port GPU RAM handler
// define the maximum address bits and number of words - effectively the RAM size
parameter ADDR_SIZE = 14;
parameter NUM_WORDS = 2 ** ADDR_SIZE;
localparam CLK_CYCLES_MUX = 1; // adjust this parameter to the number of 'clk' cycles it takes to select 1 of 5 muxed outputs
localparam CLK_CYCLES_RAM = 2; // adjust this figure to the number of clock cycles the DP_ram takes to retrieve valid data from the read address in
localparam CLK_CYCLES_PIX = 5; // adjust this figure to the number of PIXEL clock cycles it takes the demuxed output data to be ready
reg [19:0] addr_in_mux;
reg [15:0] cmd_mux_in;
reg [15:0] cmd_mux_out;
wire [19:0] addr_out_mux;
wire [7:0] data_mux_out;
// create a GPU RAM instance
gpu_dual_port_ram_INTEL gpu_RAM(
.clk(clk),
.pc_ena_in(pc_ena_in),
.clk_b(clk_b),
.wr_en_b(wr_en_b),
.addr_a(addr_in_mux),
.addr_b(),
.data_in_b(),
.cmd_in(cmd_mux_in),
.addr_out_a(addr_out_mux),
.pc_ena_out(pc_ena_out),
.cmd_out(cmd_mux_out),
.data_out_a(data_mux_out),
.data_out_b()
);
defparam gpu_RAM.ADDR_SIZE = ADDR_SIZE, // pass ADDR_SIZE into the gpu_RAM instance
gpu_RAM.NUM_WORDS = NUM_WORDS; // set non-default word size for the RAM (16 KB)
always @(posedge clk) begin
// perform 5:1 mux for all inputs to the dual-port RAM
case (pc_ena[2:0])
3'b000 : begin
addr_in_mux <= addr_in_0;
cmd_mux_in <= cmd_in_0;
addr_out_0 <= addr_out_mux;
cmd_out_0 <= cmd_mux_out;
data_out_0 <= data_mux_out;
end
3'b001 : begin
addr_in_mux <= addr_in_1;
cmd_mux_in <= cmd_in_1;
addr_out_1 <= addr_out_mux;
cmd_out_1 <= cmd_mux_out;
data_out_1 <= data_mux_out;
end
3'b011 : begin
addr_in_mux <= addr_in_2;
cmd_mux_in <= cmd_in_2;
addr_out_2 <= addr_out_mux;
cmd_out_2 <= cmd_mux_out;
data_out_2 <= data_mux_out;
end
3'b100 : begin
addr_in_mux <= addr_in_3;
cmd_mux_in <= cmd_in_3;
addr_out_3 <= addr_out_mux;
cmd_out_3 <= cmd_mux_out;
data_out_3 <= data_mux_out;
end
3'b101 : begin
addr_in_mux <= addr_in_4;
cmd_mux_in <= cmd_in_4;
addr_out_4 <= addr_out_mux;
cmd_out_4 <= cmd_mux_out;
data_out_4 <= data_mux_out;
end
endcase
end // always @clk
endmoduleOoookay... so all five outputs should be valid when pc_ena[2:0] == 0? At the moment, addr_out_0, cmd_out_0, data_out_0 will all be valid two or three clock cycles (at least) before addr_out_1, cmd_out_1 and data_out_1, etc. with the delays compounding up to the 5th set of outputs? Not to mention pc_ena needing to be delayed as well until the 5th outputs of the mux are ready?
Would it work to just route the results of the first 4 mux cycles into registers and then assign all 5 sets of results to the outputs at the end of the 5th mux cycle? Am I even understanding the issue?
Currently, the mux code is just putting the results onto the multiport outputs as soon as they come in.
Ok, here lies the trick/headache (at least until you figure out how to do it).
Yes, the way you have it written, the outputs are scrambled into the wrong demuxed ports.
Also, remember, I said I want all the demuxed outputs from the gpu_ram module to all become properly valid during the next valid pixel clock (pc_ena[3:0]==0) time slot, and, to hold their contents and all switch during the next valid pixel clock once again.
// declare registers to hold data until pc_ena[3:0] == 0 and
// it can be passed to the output IOs
reg [19:0] addr_buf_out_0,
addr_buf_out_1,
addr_buf_out_2,
addr_buf_out_3,
addr_buf_out_4;
reg [15:0] cmd_buf_out_0,
cmd_buf_out_1,
cmd_buf_out_2,
cmd_buf_out_3,
cmd_buf_out_4;
reg [7:0] data_buf_out_0,
data_buf_out_1,
data_buf_out_2,
data_buf_out_3,
data_buf_out_4;
always @(posedge clk) begin
// perform 5:1 mux for all inputs to the dual-port RAM
case (pc_ena[2:0])
3'b000 : begin
addr_in_mux <= addr_in_0;
cmd_mux_in <= cmd_in_0;
addr_buf_out_0 <= addr_mux_out;
cmd_buf_out_0 <= cmd_mux_out;
data_buf_out_0 <= data_mux_out;
end
3'b001 : begin
addr_in_mux <= addr_in_1;
cmd_mux_in <= cmd_in_1;
addr_buf_out_1 <= addr_mux_out;
cmd_buf_out_1 <= cmd_mux_out;
data_buf_out_1 <= data_mux_out;
end
3'b011 : begin
addr_in_mux <= addr_in_2;
cmd_mux_in <= cmd_in_2;
addr_buf_out_2 <= addr_mux_out;
cmd_buf_out_2 <= cmd_mux_out;
data_buf_out_2 <= data_mux_out;
end
3'b100 : begin
addr_in_mux <= addr_in_3;
cmd_mux_in <= cmd_in_3;
addr_buf_out_3 <= addr_mux_out;
cmd_buf_out_3 <= cmd_mux_out;
data_buf_out_3 <= data_mux_out;
end
3'b101 : begin
addr_in_mux <= addr_in_4;
cmd_mux_in <= cmd_in_4;
addr_buf_out_4 <= addr_mux_out;
cmd_buf_out_4 <= cmd_mux_out;
data_buf_out_4 <= data_mux_out;
end
endcase
if (pc_ena[3:0] == 0)
begin
addr_out_0 <= addr_buf_out_0;
cmd_out_0 <= cmd_buf_out_0;
data_out_0 <= data_buf_out_0;
addr_out_1 <= addr_buf_out_1;
cmd_out_1 <= cmd_buf_out_1;
data_out_1 <= data_buf_out_1;
addr_out_2 <= addr_buf_out_2;
cmd_out_2 <= cmd_buf_out_2;
data_out_2 <= data_buf_out_2;
addr_out_3 <= addr_buf_out_3;
cmd_out_3 <= cmd_buf_out_3;
data_out_3 <= data_buf_out_3;
addr_out_4 <= addr_buf_out_4;
cmd_out_4 <= cmd_buf_out_4;
data_out_4 <= data_buf_out_4;
end
end // always @clkHere is the hint#1 :
------------------------------------
localparam CLK_CYCLES_MUX = 1; // Adjust this parameter to the number of 'clk' cycles it takes to select 1 of 5 muxed inputs
localparam CLK_CYCLES_RAM = 2; // Adjust this figure to the number of clock cycles the DP_ram takes to retrieve a valid data from the read address in.
-------------------------------------
With additional variable sized piped regs will be used to generate the desired output.
Here is hint #2: (make a buss version of this single bit/wire example)
-----------------------
bla_pipe[0] <= bla_in;
bla_pipe[7:1] <= bla_pipe[6:0];
out <= bla_pipe[PIPE_DELAY-2];
-----------------------------------
Ok, not quite. Your outputs would still be scrambled, not only that, but, now, some of your outputs will be ahead by 1 pixel and the rest will have the correct pixel.
What you have setup has no configuration/adjustment capabilities.
Currently, at pc_ena 0, you may be switching to addr0, but the memory is seeing the previous addr4, while what is coming out of the ram is the previous addr2. but, you are currently snapping that into addr_buf0/data_buf0, and then feeding the previous holdings of every addr_buf#/data_buf# into addr_out#/data_out#.
At pc_ena 1, you may be switching to addr1, but the memory is seeing the previous addr0, while what is coming out of the ram is the previous addr3. and are currently snapping that into addr_buf1/data_buf1, though the addr_out#/data_out# all hold their new state from when pc_ena0 phase has updated them.
Now, you can see the headache here. Say you weed out everything as set the delays fixed and you code appears to work good. What happens after adding some features, you might need to expand your mux for speed by making it collapse in 3 clocks stages instead of the current 1 clock. What will sorting out the mess look like?

Step 1: Get rid of any and all demuxing in the case statement.
always @(posedge clk) begin
// perform 5:1 mux for all inputs to the dual-port RAM
case (pc_ena[2:0])
3'b000 : begin
addr_in_mux <= addr_in_0;
cmd_mux_in <= cmd_in_0;
end
3'b001 : begin
addr_in_mux <= addr_in_1;
cmd_mux_in <= cmd_in_1;
end
3'b011 : begin
addr_in_mux <= addr_in_2;
cmd_mux_in <= cmd_in_2;
end
3'b100 : begin
addr_in_mux <= addr_in_3;
cmd_mux_in <= cmd_in_3;
end
3'b101 : begin
addr_in_mux <= addr_in_4;
cmd_mux_in <= cmd_in_4;
end
endcase
end // always @clk
Step 2: Make a easily variable clock delay pipe tap coming out of the gpu_ram.
Example:
reg [9*8+7:0] data_pipe; // make a large enough register to store 10 words (words 0 through 9). In the case of ram data, the width of each word is 8 bits.
Now, inside the always @(posedge clk) place
data_pipe[7:0] <= data_mux_out[7:0]; // fill the first 8 bit word in the register pipe
data_pipe[9*8+7:1*8] <= data_pipe[8*8+7:0*8]; // shift over the next 9 words in this 10 word 8 bit wide pipe
if (pc_ena[3:0] == 0)
begin
data_out_0 <= data_pipe[MUX_0_POS*8+7:MUX_0_POS*8];
data_out_1 <= data_pipe[MUX_1_POS*8+7:MUX_1_POS*8];
data_out_2 <= data_pipe[MUX_2_POS*8+7:MUX_2_POS*8];
data_out_3 <= data_pipe[MUX_3_POS*8+7:MUX_3_POS*8];
data_out_4 <= data_pipe[MUX_4_POS*8+7:MUX_4_POS*8];
....
end
Now, the key is to workout the values for the 5 'MUX_#_POS' which are a fixed value based of the delays incurred by both the CLK_CYCLES_MUX = 1 and CLK_CYCLES_RAM = 2, and the tricky one, where each piece of data is in the pipe when the next pc_ena==0 comes around so that all 5 output regs take the correct pipe position.
localparam MUX_0_POS = (CLK_CYCLES_RAM + CLK_CYCLES_MUX) * 1;
localparam MUX_1_POS = (CLK_CYCLES_RAM + CLK_CYCLES_MUX) * 2;
localparam MUX_2_POS = (CLK_CYCLES_RAM + CLK_CYCLES_MUX) * 3;
localparam MUX_3_POS = (CLK_CYCLES_RAM + CLK_CYCLES_MUX) * 4;
localparam MUX_4_POS = (CLK_CYCLES_RAM + CLK_CYCLES_MUX) * 5;
always @(posedge clk) begin
data_pipe[7:0] <= data_mux_out[7:0]; // fill the first 8-bit word in the register pipe with data from RAM
data_pipe[9*8+7:1*8] <= data_pipe[8*8+7:0*8]; // shift over the next 9 words in this 10 word, 8-bit wide pipe
if (pc_ena[3:0] == 0)
begin
data_out_0 <= data_pipe[MUX_0_POS*8+7:MUX_0_POS*8];
data_out_1 <= data_pipe[MUX_1_POS*8+7:MUX_1_POS*8];
data_out_2 <= data_pipe[MUX_2_POS*8+7:MUX_2_POS*8];
data_out_3 <= data_pipe[MUX_3_POS*8+7:MUX_3_POS*8];
data_out_4 <= data_pipe[MUX_4_POS*8+7:MUX_4_POS*8];
end
Resolve MUX_#_POS using the 2 CLK_CYCLES parameters with the knowledge that pc_ena[] counts from 0-4, and in the future if you may any changes which add additional clocks in the preparation for the memory data, or different clock cycles in the FPGA memory core, or if you wire external static memory, just adjusting the 'CLK_CYCLES' parameters will allow your design to continue to function without re-doing all the individual written hardwired logic.
When calculating the MUX_#_POS may also mean adding a full additional pixel delay if one or more pixels comes in earlier that the rest, or if 1 pixel comes in later that the rest.
Yeah, I'm going to have to walk away and come back to this later in the hope that I can get my head around it from a different angle, because this just isn't going in.
Ok, this should really help, however, you will still need to concentrate a little and print out my little 2 tables on some paper to make what's happening clear.

Slide paper #2 vertically across paper #1, 5 steps at a time, (paper 2 starts at the bottom, then you slide it vertically upwards) to see where each data is in your output pipe.
Remember, the numbers in red which all have the same pixel number in green are your 5 correct functional MUX_#_POS. (Use the smallest possible valid pipe). And every time your shift the paper vertically 5 steps, the next 5 pixels should line up to the same 5 MUX_#_POS.
Subtracting the pixel number at the reg output from the bottom line on page 1, plus 1 gives you to total number of 25MHz screen pixels it takes from address# in on your gpu multiport ram, to data out#.

I hope this helps. Note that this is the most difficult part of your project. There may be a few simpler tricks to do this, but, this offers maximum flexibility and upgrade paths as the 5 ports can address anything and all results are given in parallel.

Remember, the 5 'MUX_#_POS' are 5 localparams you will set.
The 5 localparams are identical for the addr_out# & cmd_out# pip selection. It's only the *8 which will change to *20 for the address_pipe reg and *16 for the cmd_pipe reg.
module multiport_gpu_ram (
input clk, // Primary clk input (125 MHz)
input [3:0] pc_ena_in, // Pixel clock enable
input clk_b, // Host (Z80) clock input
input write_ena_b, // Host (Z80) clock enable
// address buses (input)
input [19:0] addr_in_0,
input [19:0] addr_in_1,
input [19:0] addr_in_2,
input [19:0] addr_in_3,
input [19:0] addr_in_4,
input [19:0] addr_host_in,
// auxilliary read command buses (input)
input [15:0] cmd_in_0,
input [15:0] cmd_in_1,
input [15:0] cmd_in_2,
input [15:0] cmd_in_3,
input [15:0] cmd_in_4,
// outputs
output wire [3:0] pc_ena_out,
// address pass-thru bus (output)
output reg [19:0] addr_out_0,
output reg [19:0] addr_out_1,
output reg [19:0] addr_out_2,
output reg [19:0] addr_out_3,
output reg [19:0] addr_out_4,
output reg [19:0] addr_host_out,
// auxilliary read command bus (pass-thru output)
output reg [15:0] cmd_out_0,
output reg [15:0] cmd_out_1,
output reg [15:0] cmd_out_2,
output reg [15:0] cmd_out_3,
output reg [15:0] cmd_out_4,
// data buses (output)
output reg [7:0] data_out_0,
output reg [7:0] data_out_1,
output reg [7:0] data_out_2,
output reg [7:0] data_out_3,
output reg [7:0] data_out_4,
output [7:0] data_host_out
);
// dual-port GPU RAM handler
// define the maximum address bits and number of words - effectively the RAM size
parameter ADDR_SIZE = 14;
parameter NUM_WORDS = 2 ** ADDR_SIZE;
localparam CLK_CYCLES_MUX = 1; // adjust this parameter to the number of 'clk' cycles it takes to select 1 of 5 muxed outputs
localparam CLK_CYCLES_RAM = 2; // adjust this figure to the number of clock cycles the DP_ram takes to retrieve valid data from the read address in
localparam CLK_CYCLES_PIX = 5; // adjust this figure to the number of PIXEL clock cycles it takes the demuxed output data to be ready
reg [19:0] addr_in_mux;
reg [15:0] cmd_mux_in;
reg [15:0] cmd_mux_out;
wire [19:0] addr_mux_out;
wire [7:0] data_mux_out;
// create a GPU RAM instance
gpu_dual_port_ram_INTEL gpu_RAM(
.clk(clk),
.pc_ena_in(pc_ena_in),
.clk_b(clk_b),
.wr_en_b(wr_en_b),
.addr_a(addr_in_mux),
.addr_b(),
.data_in_b(),
.cmd_in(cmd_mux_in),
.addr_out_a(addr_mux_out),
.pc_ena_out(pc_ena_out),
.cmd_out(cmd_mux_out),
.data_out_a(data_mux_out),
.data_out_b()
);
defparam gpu_RAM.ADDR_SIZE = ADDR_SIZE, // pass ADDR_SIZE into the gpu_RAM instance
gpu_RAM.NUM_WORDS = NUM_WORDS; // set non-default word size for the RAM (16 KB)
reg [9*8+7:0] data_pipe;
reg [9*20+19:0] addr_pipe;
reg [9*16+15:0] cmd_pipe;
localparam MUX_0_POS = 5;
localparam MUX_1_POS = 4;
localparam MUX_2_POS = 3;
localparam MUX_3_POS = 2;
localparam MUX_4_POS = 1;
always @(posedge clk) begin
data_pipe[7:0] <= data_mux_out[7:0]; // fill the first 8-bit word in the register pipe with data from RAM
data_pipe[9*8+7:1*8] <= data_pipe[8*8+7:0*8]; // shift over the next 9 words in this 10 word, 8-bit wide pipe
// this moves the data up one word at a time, dropping the top most 8 bits
addr_pipe[19:0] <= addr_mux_out;
addr_pipe[9*20+19:1*20] <= addr_pipe[8*20+19:0*20];
cmd_pipe[15:0] <= cmd_mux_out[15:0];
cmd_pipe[9*16+15:1*16] <= cmd_pipe[8*16+15:0*16];
if (pc_ena[3:0] == 0)
begin
addr_out_0 <= addr_pipe[MUX_0_POS*20+19:MUX_0_POS*20];
addr_out_1 <= addr_pipe[MUX_1_POS*20+19:MUX_1_POS*20];
addr_out_2 <= addr_pipe[MUX_2_POS*20+19:MUX_2_POS*20];
addr_out_3 <= addr_pipe[MUX_3_POS*20+19:MUX_3_POS*20];
addr_out_4 <= addr_pipe[MUX_4_POS*20+19:MUX_4_POS*20];
cmd_out_0 <= cmd_pipe[MUX_0_POS*16+15:MUX_0_POS*16];
cmd_out_1 <= cmd_pipe[MUX_1_POS*16+15:MUX_1_POS*16];
cmd_out_2 <= cmd_pipe[MUX_2_POS*16+15:MUX_2_POS*16];
cmd_out_3 <= cmd_pipe[MUX_3_POS*16+15:MUX_3_POS*16];
cmd_out_4 <= cmd_pipe[MUX_4_POS*16+15:MUX_4_POS*16];
data_out_0 <= data_pipe[MUX_0_POS*8+7:MUX_0_POS*8];
data_out_1 <= data_pipe[MUX_1_POS*8+7:MUX_1_POS*8];
data_out_2 <= data_pipe[MUX_2_POS*8+7:MUX_2_POS*8];
data_out_3 <= data_pipe[MUX_3_POS*8+7:MUX_3_POS*8];
data_out_4 <= data_pipe[MUX_4_POS*8+7:MUX_4_POS*8];
end
// perform 5:1 mux for all inputs to the dual-port RAM
case (pc_ena[2:0])
3'b000 : begin
addr_in_mux <= addr_in_0;
cmd_mux_in <= cmd_in_0;
end
3'b001 : begin
addr_in_mux <= addr_in_1;
cmd_mux_in <= cmd_in_1;
end
3'b011 : begin
addr_in_mux <= addr_in_2;
cmd_mux_in <= cmd_in_2;
end
3'b100 : begin
addr_in_mux <= addr_in_3;
cmd_mux_in <= cmd_in_3;
end
3'b101 : begin
addr_in_mux <= addr_in_4;
cmd_mux_in <= cmd_in_4;
end
endcase
end // always @clk
endmodule
)parameter PIPE_DELAY = 4; // This parameter selects the number of pixel clocks which the output VDE and syncs are delayed
// The disp_x is the X coordinate counter. It runs from 0 to 512 and stops there
// The disp_y is the Y coordinate sounter. It runs from 0 to 256 and stops there
assign disp_pos[4:0] = disp_x[8:4] ; // The disp_pos[4:0] is the lower address for the 32 characters wide display ascii text.
assign disp_pos[8:5] = disp_y[7:4] ; // the disp_pos[8:5] is the upper address for the 16 lines of text
// The result from the ascii memory component 'altsyncram_component_osd_mem' is called letter[7:0]
// Since disp_pos[8:0] has entered the read address, it takes 2 pixel clock cycles for the resulting letter[7:0] to come out.
// Now, font_pos[12:0] is the read address for the memory block containing the font memory
assign font_pos[12:6] = letter[6:0] ; // Selec the upper font address with the 7 bit letter, note the atari font has only 128 characters.
assign font_pos[2:0] = dly2_disp_x[3:1] ; // select the font X coordinate with a 2 pixel clock DELAYES disp_x address. [3:1] is used so that every 2 x pixels are repeats
assign font_pos[5:3] = dly2_disp_y[3:1] ; // select the font y coordinate with a 2 pixel clock DELAYES disp_y address. [3:1] is used so that every 2 y lines are repeats
// The resulting font image, 2 bits since I made a 2 bit color atari font is assigned to the OSD[1:0] output
// Also, since there is an 8th bit in the ascii test memory, I use that as a third OSD output color bit.
assign osd_image[1:0] = osd_img[1:0];
assign osd_image[2] = dly2_letter[7]; // Remember, it takes 2 pixel clocks for osd_img[1:0] data to be valid from read address letter[6:0]
// **********************************************************************************************
// AND
// **********************************************************************************************
osd_ena_out <= dly2_dena; // This is used to drive a graphics A/B switch which tells when the OSD graphics should be shown
// It needs to be delayed by the number of pixel clocks required for the above memories
Currently, in your code, your first mux takes 1 clock and the INTEL altsyncram megafunction takes 2 clocks:
----------------------------
into data_pipe[0*8+7:0*8]
inside ram 2 2 clock cycles here for INTEL's altsyncram function.
inside ram 1
addr to addr-mux 1 clock cycle here for you current MUX code.
PC_ENA pos0
-------------------------------
Now, also knowing that PC_ENA has 5 positions per pixel, and using those 3 reference 'CLK_CYCLES_xxxx' which describes the # of clocks as each step in your delay pipe, write a formula which fills in all 5 'localparam MUX_#_POS's numbers.
localparam CLK_CYCLES_MUX = 1; // adjust this parameter to the number of 'clk' cycles it takes to select 1 of 5 muxed outputs
localparam CLK_CYCLES_RAM = 2; // adjust this figure to the number of clock cycles the DP_ram takes to retrieve valid data from the read address in
localparam CLK_CYCLES_PIX = 5; // adjust this figure to the number of PIXEL clock cycles it takes the demuxed output data to be ready
localparam MUX_0_POS = 6; // pixel offset positions in their respective synchronisation
localparam MUX_1_POS = 5; // pipelines (where the pixels will be found in the pipeline
localparam MUX_2_POS = 4; // when pc_ena[3:0]==0).
localparam MUX_3_POS = 3; //
localparam MUX_4_POS = 2; //MUX_POS = 2*CLK_CYCLES_MUX + 2*CLK_CYCLES_RAM + 2*CLK_CYCLES_PIX - (POS + 10);
Also, you should realize trying to properly unmux the data stream coming out of the ram could never be done properly any other way without pure luck. And with luck, if you ever had to increase a number of clock steps in for example the addr-mux stage, or, the FPGA's altsyncram dual port ram function, everything would fall apart and luck again would be needed to hope you get the right output all in parallel.
Ok, next....
You will need to create a 16 kilobyte "gpu_ram_init.mif" file. This file should contain my Atari font beginning at byte 4'h0800 & the test memory text file beginning at byte 4'h1000.
Next, remove my old 2 alt synccrams from the OSG generator and wire in your new "multiport_gpu_ram.v".
You will need to appropriately cross wire in my old addresses into the 20 bit address, hard wiring the upper addresses to the 2 bases 4h'0800 & 4h'1000.


wire disp_addr[12:0];
disp_addr[12] = 1'b1; // set 13th bit to 1 to start address at 0x1000
disp_addr[8:0] = disp_pos[8:0]; // map display position address into new memory address
wire font_addr[11:0];
font_addr[11] = 1'b1; // set 12th bit to 1 to start address at 0x800
disp_addr[9:0] = font_pos[9:0]; // map display position address into new memory address

// ****************************************************************************************************************************
// create a multiport GPU RAM handler instance
// ****************************************************************************************************************************
multiport_gpu_ram gpu_RAM(
.clk(clk),
.pc_ena_in(pc_ena),
.clk_b(),
.write_ena_b(),
.addr_in_0(),
.addr_in_1(),
.addr_in_2(),
.addr_in_3(),
.addr_in_4(),
.addr_host_in(),
.cmd_in_0(),
.cmd_in_1(),
.cmd_in_2(),
.cmd_in_3(),
.cmd_in_4(),
.pc_ena_out(),
.addr_out_0(),
.addr_out_1(),
.addr_out_2(),
.addr_out_3(),
.addr_out_4(),
.addr_host_out(),
.cmd_out_0(),
.cmd_out_1(),
.cmd_out_2(),
.cmd_out_3(),
.cmd_out_4(),
.data_out_0(),
.data_out_1(),
.data_out_2(),
.data_out_3(),
.data_out_4(),
.data_host_out()
);Also, in my old code, each read took 2 pixel clocks, now each new ram read takes 3 pixel clocks. (even I might have counted wrong here, but I'm pretty sure it's 3...)
This means touching up inside the OSD:Code: [Select]assign osd_image[1:0] = osd_img[1:0];
assign osd_image[2] = dly2_letter[7]; // Remember, it takes 2 pixel clocks for osd_img[1:0] data to be valid from read address letter[6:0]
// **********************************************************************************************
// AND
// **********************************************************************************************
osd_ena_out <= dly2_dena; // This is used to drive a graphics A/B switch which tells when the OSD graphics should be shown
// It needs to be delayed by the number of pixel clocks required for the above memories
[/code]
You should be able to recreate your last OSD image perfectly. If so, congratulations, you passed the 50% mark of completing version 1.0!.

If you have time, you should now think about getting 12bit color working as this is coming up next.[/code]
Ah okay - will see what I can do.This is the fun part is coming... You will soon need to think about videos, not just pictures, but, you will need to fill the memory_init file with data for things to happen and you will need to get a Z80 connected with software to achieve anything interesting.
Looking forward to getting the host connected and getting a working video console. That'll be a big win for me.
Currently, in your code, your first mux takes 1 clock and the INTEL altsyncram megafunction takes 2 clocks:
----------------------------
into data_pipe[0*8+7:0*8]
inside ram 2 2 clock cycles here for INTEL's altsyncram function.
inside ram 1
addr to addr-mux 1 clock cycle here for you current MUX code.
PC_ENA pos0
-------------------------------
Now, also knowing that PC_ENA has 5 positions per pixel, and using those 3 reference 'CLK_CYCLES_xxxx' which describes the # of clocks as each step in your delay pipe, write a formula which fills in all 5 'localparam MUX_#_POS's numbers.Code: [Select]localparam CLK_CYCLES_MUX = 1; // adjust this parameter to the number of 'clk' cycles it takes to select 1 of 5 muxed outputs
localparam CLK_CYCLES_RAM = 2; // adjust this figure to the number of clock cycles the DP_ram takes to retrieve valid data from the read address in
localparam CLK_CYCLES_PIX = 5; // adjust this figure to the number of PIXEL clock cycles it takes the demuxed output data to be ready
localparam MUX_0_POS = 6; // pixel offset positions in their respective synchronisation
localparam MUX_1_POS = 5; // pipelines (where the pixels will be found in the pipeline
localparam MUX_2_POS = 4; // when pc_ena[3:0]==0).
localparam MUX_3_POS = 3; //
localparam MUX_4_POS = 2; //
To get the MUX_x_POS values from those parameters, I've come up with this:Code: [Select]MUX_POS = 2*CLK_CYCLES_MUX + 2*CLK_CYCLES_RAM + 2*CLK_CYCLES_PIX - (POS + 10);
It's not elegant, but seems to do the job and works with parameter changes as far as I can tell. I've had to add another parameter - POS - which replaces the _x_ in MUX_x_POS... And I'm not 100% happy with the constant '10' being in there... so I guess what I'm trying to say is that I'm not 100% happy with the formula...
I know Verilog isn't a programming language, but is there any chance we can replace all those MUX_x_POS parameters with a single array of values and access them using POS as the index?
parameter PIXEL_PIPE = 3; // This externally set parameter defines the number of 25MHz pixels it takes to receive a new pixel from a presented address
localparam CLK_CYCLES_MUX = 1; // adjust this parameter to the number of 'clk' cycles it takes to select 1 of 5 muxed outputs
localparam CLK_CYCLES_RAM = 2; // adjust this figure to the number of clock cycles the DP_ram takes to retrieve valid data from the read address in
localparam CLK_CYCLES_PIX = 5; // adjust this figure to the number of 125MHz clocks there are for each pixel, IE number of muxed inputs for each pixel
// This parameter begins with the wanted top number of 125Mhz pixel clock headroom for the pixel pipe, then subtracts the additional 125MHz clocks used by the _MUX and _RAM cycles used to arrive at the first pixel out, DEMUX_PIPE_TOP position.
localparam DEMUX_PIPE_TOP = (( (PIXEL_PIPE - 1) * CLK_CYCLES_PIX ) - 1) - CLK_CYCLES_MUX - CLK_CYCLES_RAM;
localparam MUX_0_POS = DEMUX_PIPE_TOP - 0; // pixel offset positions in their respective synchronisation
localparam MUX_1_POS = DEMUX_PIPE_TOP - 1; // pipelines (where the pixels will be found in the pipeline
localparam MUX_2_POS = DEMUX_PIPE_TOP - 2; // when pc_ena[3:0]==0).
localparam MUX_3_POS = DEMUX_PIPE_TOP - 3; //
localparam MUX_4_POS = DEMUX_PIPE_TOP - 4; //
// Now that we know the DEMUX_PIPE_TOP, we can assign the top size of the 3 pipe regs
reg [DEMUX_PIPE_TOP*8+7:0] data_pipe;
reg [DEMUX_PIPE_TOP*20+19:0] addr_pipe;
reg [DEMUX_PIPE_TOP*16+15:0] cmd_pipe;
// We also need to limit the pipe in the 3 ' <= '
data_pipe[7:0] <= data_mux_out[7:0]; // fill the first 8-bit word in the register pipe with data from RAM
data_pipe[DEMUX_PIPE_TOP*8+7:1*8] <= data_pipe[ (DEMUX_PIPE_TOP-1) *8+7:0*8]; // shift over the next 9 words in this 10 word, 8-bit wide pipe
// this moves the data up one word at a time, dropping the top most 8 bits
addr_pipe[19:0] <= addr_mux_out;
addr_pipe[DEMUX_PIPE_TOP*20+19:1*20] <= addr_pipe[ (DEMUX_PIPE_TOP-1) *20+19:0*20];
cmd_pipe[15:0] <= cmd_mux_out[15:0];
cmd_pipe[DEMUX_PIPE_TOP*16+15:1*16] <= cmd_pipe[ (DEMUX_PIPE_TOP-1) *16+15:0*16];
.address_b (addr_b[ (ADDR_SIZE-1) :0]),
.clock1 (clk_b),
.data_b (data_in_b),
.wren_b (wr_en_b),
.address_a (addr_a[ (ADDR_SIZE-1) :0]),
Changed the 'ADDR_SIZE)' to '(ADDR_SIZE-1)'. altsyncram_component.widthad_a = ADDR_SIZE,
altsyncram_component.widthad_b = ADDR_SIZE,
Changed the 'ADDR_SIZE - 1' to 'ADDR_SIZE'.
There is a bug in your mux (partially my bad...) in the 'multiport_gpu_ram.v', however, we will fix it after you get your garbled text with the current version. It's one of those 'gotcha' things which would leave many helpless without running a simulator, even so, it would take days to fix depending on the rest of the design. Don't threat, the real solution is easy if you know what you are doing, we will fix it next.Insert the following into the OSD code:
wire [19:0] read_text_adr;
wire [19:0] read_font_adr;
assign read_text_adr[8:0] = disp_pos[8:0];
assign read_text_adr[9] = 1b'0;
assign read_text_adr[19:10] = 1'h4;
assign read_font_adr[9:0] = font_pos[9:0];
assign read_font_adr[19:10] = 1'h2;
Now, pass read_font_adr & read_text_adr into 2 memory address 'addr_in_#()' ports of your liking.
And in the appropriate 'data_out_#()' ports, place the 'letter[7:0]' and 'char_line[7:0]'.
parameter PIPE_DELAY = 6; // This parameter selects the number of pixel clocks to delay the VDE and sync outputs. Only use 2 through 9.
wire [9:0] font_pos;
wire [8:0] disp_pos;
wire [2:0] osd_image;
wire [19:0] read_text_adr;
wire [19:0] read_font_adr;
assign read_text_adr[8:0] = disp_pos[8:0];
assign read_text_adr[9] = 1b'0;
assign read_text_adr[19:10] = 1'h4;
assign read_font_adr[9:0] = font_pos[9:0];
assign read_font_adr[19:10] = 1'h2;
// ****************************************************************************************************************************
// create a multiport GPU RAM handler instance
// ****************************************************************************************************************************
multiport_gpu_ram gpu_RAM(
.clk(clk),
.pc_ena_in(pc_ena),
.clk_b(),
.write_ena_b(),
.addr_in_0(read_font_adr),
.addr_in_1(read_text_adr),
.addr_in_2(),
.addr_in_3(),
.addr_in_4(),
.addr_host_in(),
.cmd_in_0(),
.cmd_in_1(),
.cmd_in_2(),
.cmd_in_3(),
.cmd_in_4(),
.pc_ena_out(),
.addr_out_0(),
.addr_out_1(),
.addr_out_2(),
.addr_out_3(),
.addr_out_4(),
.addr_host_out(),
.cmd_out_0(),
.cmd_out_1(),
.cmd_out_2(),
.cmd_out_3(),
.cmd_out_4(),
.data_out_0(letter[7:0]),
.data_out_1(char_line[7:0]),
.data_out_2(),
.data_out_3(),
.data_out_4(),
.data_host_out()
);Without any other changes, this should generate a messed up text display as my old pipe delays were designed for 2 pixel clocks on each read, not the new current 3. To fix this, you need to make the changes I listed in red:
parameter PIPE_DELAY = 6; // This parameter selects the number of pixel clocks to delay the VDE and sync outputs. Only use 2 through 9.
assign font_pos[12:6]= letter[6:0] ; // Select the upper font address with the 7 bit letter, note the atari font has only 128 characters.
assign font_pos[2:0] = dly3_disp_y[3:1] ; // select the font x coordinate with a 2 pixel clock DELAYED disp_x address. [3:1] is used so that every 2 x lines are repeats
assign font_pos[5:3] = dly3_disp_y[3:1] ; // select the font y coordinate with a 2 pixel clock DELAYED disp_y address. [3:1] is used so that every 2 y lines are repeats
assign osd_image[1:0] = osd_image[1:0]; You are working with the wrong version of OSD generator. You already changed this line to convert a 8bit font line into 8 individual pixels, 1 bit color B&W font, remember?
assign osd_image[2] = dly3_letter[7]; // Remember, it takes 2 pixel clocks for osd_img[1:0] data to be valid from read address letter[6:0]
osd_ena_out <= dly4_dena; // This is used to drive a graphics A/B switch which tells when the OSD graphics should be shown[/code]
Done. I've also added the extra dly lines.Code: [Select]module vid_osd_generator ( clk, pc_ena, hde_in, vde_in, hs_in, vs_in, osd_ena_out, osd_image, hde_out, vde_out, hs_out, vs_out,
wren_disp, wren_font, wr_addr, wr_data );
// To write contents into the display and font memories, the wr_addr[15:0] selects the address
// the wr_data[7:0] contains a byte which will be written
// the wren_disp is the write enable for the ascii text ram. Only the wr_addr[8:0] are used as the character display is 32x16.
// the wren_font is the write enable for the font memory. Only 2 bits are used of the wr_data[1:0] and wr_addr[12:0] are used.
// tie these ports to GND for now disabling them
input clk;
input [3:0] pc_ena;
input hde_in, vde_in, hs_in, vs_in;
output osd_ena_out;
reg osd_ena_out;
output [2:0] osd_image;
output hde_out, vde_out, hs_out, vs_out;
reg hde_out, vde_out, hs_out, vs_out;
input wren_disp, wren_font;
input [15:0] wr_addr;
input [7:0] wr_data;
reg [9:0] disp_x,dly1_disp_x,dly2_disp_x,dly3_disp_x,dly4_disp_x;
reg [8:0] disp_y,dly1_disp_y,dly2_disp_y;
reg dena,dly1_dena,dly2_dena,dly3_dena,dly4_dena;
reg [7:0] dly1_letter, dly2_letter;
reg [7:0] hde_pipe, vde_pipe, hs_pipe, vs_pipe;
parameter PIPE_DELAY = 6; // This parameter selects the number of pixel clocks to delay the VDE and sync outputs. Only use 2 through 9.
wire [9:0] font_pos;
wire [8:0] disp_pos;
wire [2:0] osd_image;
wire [19:0] read_text_adr;
wire [19:0] read_font_adr;
assign read_text_adr[8:0] = disp_pos[8:0];
assign read_text_adr[9] = 1b'0;
assign read_text_adr[19:10] = 1'h4;
assign read_font_adr[9:0] = font_pos[9:0];
assign read_font_adr[19:10] = 1'h2;
// ****************************************************************************************************************************
// create a multiport GPU RAM handler instance
// ****************************************************************************************************************************
multiport_gpu_ram gpu_RAM(
.clk(clk),
.pc_ena_in(pc_ena),
.clk_b(),
.write_ena_b(),
.addr_in_0(read_font_adr),
.addr_in_1(read_text_adr),
.addr_in_2(),
.addr_in_3(),
.addr_in_4(),
.addr_host_in(),
.cmd_in_0(),
.cmd_in_1(),
.cmd_in_2(),
.cmd_in_3(),
.cmd_in_4(),
.pc_ena_out(),
.addr_out_0(),
.addr_out_1(),
.addr_out_2(),
.addr_out_3(),
.addr_out_4(),
.addr_host_out(),
.cmd_out_0(),
.cmd_out_1(),
.cmd_out_2(),
.cmd_out_3(),
.cmd_out_4(),
.data_out_0(letter[7:0]),
.data_out_1(char_line[7:0]),
.data_out_2(),
.data_out_3(),
.data_out_4(),
.data_host_out()
);
// The disp_x is the X coordinate counter. It runs from 0 to 512 and stops there
// The disp_y is the Y coordinate counter. It runs from 0 to 256 and stops there
// Get the character at the current x, y position
assign disp_pos[4:0] = disp_x[8:4] ; // The disp_pos[4:0] is the lower address for the 32 characters for the ascii text.
assign disp_pos[8:5] = disp_y[7:4] ; // the disp_pos[8:5] is the upper address for the 16 lines of text
// The result from the ascii memory component 'altsyncram_component_osd_mem' is called letter[7:0]
// Since disp_pos[8:0] has entered the read address, it takes 2 pixel clock cycles for the resulting letter[7:0] to come out.
// Now, font_pos[12:0] is the read address for the memory block containing the character specified in letter[]
assign font_pos[12:6]= letter[6:0] ; // Select the upper font address with the 7 bit letter, note the atari font has only 128 characters.
assign font_pos[2:0] = dly3_disp_y[3:1] ; // select the font x coordinate with a 2 pixel clock DELAYED disp_x address. [3:1] is used so that every 2 x lines are repeats
assign font_pos[5:3] = dly3_disp_y[3:1] ; // select the font y coordinate with a 2 pixel clock DELAYED disp_y address. [3:1] is used so that every 2 y lines are repeats
// The resulting 2-bit font image at x is assigned to the OSD[1:0] output
// Also, since there is an 8th bit in the ascii text memory, I use that as a third OSD output color bit
assign osd_image[1:0] = char_line[(~dly4_disp_x[3:1])];
assign osd_image[2] = dly3_letter[7]; // Remember, it takes 2 pixel clocks for osd_img[1:0] data to be valid from read address letter[6:0]
always @ ( posedge clk ) begin
if (pc_ena[2:0] == 0) begin
// **************************************************************************************************************************
// *** Create a serial pipe where the PIPE_DELAY parameter selects the pixel count delay for the xxx_in to the xxx_out ports
// **************************************************************************************************************************
hde_pipe[0] <= hde_in;
hde_pipe[7:1] <= hde_pipe[6:0];
hde_out <= hde_pipe[PIPE_DELAY-2];
vde_pipe[0] <= vde_in;
vde_pipe[7:1] <= vde_pipe[6:0];
vde_out <= vde_pipe[PIPE_DELAY-2];
hs_pipe[0] <= hs_in;
hs_pipe[7:1] <= hs_pipe[6:0];
hs_out <= hs_pipe[PIPE_DELAY-2];
vs_pipe[0] <= vs_in;
vs_pipe[7:1] <= vs_pipe[6:0];
vs_out <= vs_pipe[PIPE_DELAY-2];
// **********************************************************************************************
// This OSD generator's window is only 512 pixels by 256 lines.
// Since the disp_X&Y counters are the screens X&Y coordinates, I'm using an extra most
// significant bit in the counters to determine if the OSD ena flag should be on or off.
if (disp_x[9] || disp_y[8])
dena <= 0; // When disp_x > 511 or disp_y > 255, then turn off the OSD's output enable flag
else
dena <= 1; // otherwise, turn on the OSD output enable flag
if (~vde_in)
disp_y[8:0] <= 9'b111111111; // preset the disp_y counter to max while the vertical display is disabled
else if (hde_in && ~hde_pipe[0])
begin // isolate a single event at the begining of the active display area
disp_x[9:0] <= 10'b0000000000; // clear the disp_x counter
if (!disp_y[8] | (disp_y[8:7] == 2'b11))
disp_y <= disp_y + 1; // only increment the disp_y counter if it hasn't reached it's end
end
else if (!disp_x[9])
disp_x <= disp_x + 1; // keep on addind to the disp_x counter until it reaches it's end.
// **********************************************************************************************
// *** These delay pipes registers are explained in the 'assign's above
// **********************************************************************************************
dly1_disp_x <= disp_x;
dly2_disp_x <= dly1_disp_x;
dly3_disp_x <= dly2_disp_x;
dly4_disp_x <= dly3_disp_x;
dly1_disp_y <= disp_y;
dly2_disp_y <= dly1_disp_y;
dly3_disp_y <= dly2_disp_y;
dly1_letter <= letter;
dly2_letter <= dly1_letter;
dly3_letter <= dly2_letter;
dly1_dena <= dena;
dly2_dena <= dly1_dena;
dly3_dena <= dly2_dena;
dly4_dena <= dly3_dena;
// **********************************************************************************************
osd_ena_out <= dly4_dena; // This is used to drive a graphics A/B switch which tells when the OSD graphics should be shown
// It needs to be delayed by the number of pixel clocks required for the above memories
end // ena
end // always@clk
endmodule
reg [15:0] cmd_mux_out; **** I APPLIED FIX HERE - CHANGED TO 'WIRE'
wire [19:0] addr_mux_out;
wire [7:0] data_mux_out;
// create a GPU RAM instance
gpu_dual_port_ram_INTEL gpu_RAM(
.clk(clk),
.pc_ena_in(pc_ena_in),
.clk_b(clk_b),
.wr_en_b(wr_en_b),
.addr_a(addr_in_mux),
.addr_b(),
.data_in_b(),
.cmd_in(cmd_mux_in),
.addr_out_a(addr_mux_out),
.pc_ena_out(pc_ena_out),
.cmd_out(cmd_mux_out), *********** THIS LINE IS WHERE THE ERROR POINTS TO
.data_out_a(data_mux_out),
.data_out_b()
);
multiport_gpu_ram gpu_RAM(
.clk(clk),
.pc_ena_in(pc_ena),
.clk_b(clk),
.write_ena_b(1'b1),
.addr_in_0(read_font_adr),
.addr_in_1(read_text_adr),
.addr_in_2(20'b0),
.addr_in_3(20'b0),
.addr_in_4(20'b0),
.addr_host_in(20'b0),
.cmd_in_0(16'b0),
.cmd_in_1(16'b0),
.cmd_in_2(16'b0),
.cmd_in_3(16'b0),
.cmd_in_4(16'b0),
.pc_ena_out(),
.addr_out_0(),
.addr_out_1(),
.addr_out_2(),
.addr_out_3(),
.addr_out_4(),
.addr_host_out(),
.cmd_out_0(),
.cmd_out_1(),
.cmd_out_2(),
.cmd_out_3(),
.cmd_out_4(),
.data_out_0(letter[7:0]),
.data_out_1(char_line[7:0]),
.data_out_2(),
.data_out_3(),
.data_out_4(),
.data_host_out()
);
Yeah, getting it to compile was problematic - there were a few errors, but nothing insurmountable.
I'm not sure about the fix for this one, though:
Error (10663): Verilog HDL Port Connection error at multiport_gpu_ram.v(75): output or inout port "cmd_out" must be connected to a structural net expression
Does your altsyncram memory have anything in it?
Was just testing. 
Error (10663): Verilog HDL Port Connection error at multiport_gpu_ram.v(75): output or inout port "cmd_out" must be connected to a structural net expression
Something else is off, otherwise the "addr_out" would have a similar error.