Author Topic: Home made SystemVerilog 3 word, Zero latency FIFO, documented for beginners.  (Read 5909 times)

0 Members and 1 Guest are viewing this topic.

Offline BrianHGTopic starter

  • Super Contributor
  • ***
  • Posts: 7727
  • Country: ca
Here is my home made 3 word, plus 1 extra reserve word (4 word total), Zero Latency FIFO, with 'look ahead' data and status flag outputs written in System Verilog.

The Zero Latency means while the FIFO is empty, the 'shift_in' and 'data_in' are wired directly to the 'fifo_not_empty' and 'data_out' incurring 0 clock cycle delay when the FIFO is used as a small data command buffer.

The outputs are also 'look ahead' which means, the data and flag outputs already have current valid data before you assert a clocked 'shift_out'.  This means when you assert a shift out, on that cycle, you retrieve the output data and the FIFO prepares the next word on the next clock.  You will then receive ahead of time the fifo_not_empty flag depending on the availability of more new data or not.  This is also the reason for the +1 extra word in the FIFO, or, the FIFO full flag being returned 1 word early of it being truly full.  This gives your other logic an extra clock transfer cycle to halt processes since it may take 1 clock for those modules to respond to the fifo_full flag.

The 'FIFO_3word_0_latency.sv' has 2 parameters, bits which sets the width of the FIFO and zero_latency which enables/disables the 0 clock cycle delay on the output data.

See this image for functionality:

1046066-0

The source code has all the inputs, outputs & parameters well documented in the next post.
« Last Edit: August 13, 2020, 06:30:14 am by BrianHG »
 

Offline BrianHGTopic starter

  • Super Contributor
  • ***
  • Posts: 7727
  • Country: ca
Get final version 3.1 here: V3.1 with 7 FIFO word mode and underflow/overflow protection.


Obsolete V1.0 here:
Code: [Select]

// *****************************************************************
// *** FIFO_3word_0_latency.sv V1.0, July 6, 2020
// ***
// *** This 3 word + 1 reserve word Zero latency FIFO
// *** with look ahead data and status flags outputs was
// *** written by Brian Guralnick.
// ***
// *** See the included 'FIFO_0_latency.png' simulation for functionality.
// ***
// *** Using System Verilog code which only uses synchronous logic.
// *** Well commented for educational purposes.
// *****************************************************************

module FIFO_3word_0_latency (

input wire clk,                  // CLK input
input wire reset,                // reset FIFO

input  wire shift_in,            // load a word into the FIFO.
input  wire shift_out,           // shift data out of the FIFO.
input  wire [bits-1:0] data_in,  // data word input.

output wire fifo_not_empty,      // High when there is data available.
output wire fifo_full,           // High when the FIFO is 3 words full.
                                 // *** Note the FIFO has 1 extra word of free space after
                                 // *** the fifo_full flag goes high, so it's actually a 4 word FIFO.

output wire [bits-1:0] data_out  // FIFO data word output
);

parameter bits = 8 ;             // sets the width of the fifo
parameter zero_latency = 1;      // When set to 1, if the FIFO is empty, the data_out and fifo_empty flag will
                                 // immediately reflect the state of the inputs data_in and shift_in, 0 clock cycle delay.
                                 // When set to 0, like a normal synchronous FIFO, It will take 1 clock cycle before the
                                 // fifo_empty flag goes high and the data_out will have a copy of the data_in after a 'shift_in'.

reg  [bits-1:0] fifo_data_reg[3:0] ;                  // FIFO memory
reg  [2:0]      fifo_wr_pos, fifo_rd_pos, fifo_size ; // the amount of data stored in the fifo


assign data_out       = ( fifo_size == 0 ) && (zero_latency ) ?  data_in : fifo_data_reg[fifo_rd_pos[1:0]] ; // When FIFO is empty and parameter zero_latency = 1,
                                                                                    // bypass the memory registers and wire the output directly to the input.
                                                                                    // Otherwise show the correct FIFO memory register once it is latched.

assign fifo_not_empty = (fifo_size != 0) || (zero_latency && shift_in);             // While the FIFO is empty and zero_latency = 1, directly wire
                                                                                    // 'fifo_not_empty' output to the 'shift_in' input.  Otherwise,
assign fifo_full      = (fifo_size >= 3'd3);                                        // only set high once there is data in the FIFO.

always @ (posedge clk) begin

if (reset) begin

fifo_data_reg[0] <= 0 ;  // clear the FIFO's memory contents
fifo_data_reg[1] <= 0 ;  // clear the FIFO's memory contents
fifo_data_reg[2] <= 0 ;  // clear the FIFO's memory contents
fifo_data_reg[3] <= 0 ;  // clear the FIFO's memory contents

fifo_rd_pos       <= 3'd0;      // reset the FIFO memory counter
fifo_wr_pos       <= 3'd0;      // reset the FIFO memory counter
fifo_size         <= 3'd0;      // reset the FIFO memory counter

end else begin

            if (  shift_in && ~shift_out ) fifo_size <= fifo_size + 1'b1; // Calculate the number of words stored in the FIFO
       else if ( ~shift_in &&  shift_out ) fifo_size <= fifo_size - 1'b1;

                 if ( shift_in ) begin
                      fifo_wr_pos                     <= fifo_wr_pos + 1'b1 ;
                      fifo_data_reg[fifo_wr_pos[1:0]] <= data_in ;
                      end
                 if ( shift_out ) begin
                      fifo_rd_pos                     <= fifo_rd_pos + 1'b1 ;
                      end
       
   end // ~reset


end // always @ (posedge clk) begin
endmodule


« Last Edit: July 12, 2020, 02:35:52 am by BrianHG »
 
The following users thanked this post: agehall

Offline hamster_nz

  • Super Contributor
  • ***
  • Posts: 2803
  • Country: nz
Nice work!

Must play havoc with your timing closure though????
Gaze not into the abyss, lest you become recognized as an abyss domain expert, and they expect you keep gazing into the damn thing.
 

Offline BrianHGTopic starter

  • Super Contributor
  • ***
  • Posts: 7727
  • Country: ca
Nice work!

Must play havoc with your timing closure though????
     Only if you use the FIFO wired directly to and from IO pins.  If the FIFO is internal, the compiler would consider the 'WIRING' from it's input port to the output port just part of the logic gates, or a mux selection logic within your design.  I still get an fmax in the 350MHz range on the slowest CycloneIV.  This is the reason for making the FIFO only 4 bytes exactly.  Anything width up to 36 bits will give me that theoretical 350MHz range.  Or in other words, inserting such a FIFO anywhere inside my existing designs won't lower my design's current fmax.

     However, wired to IO pins, you need to take into account the input's tsu and un-latched data output registers behaving like a mux or gates, not clocked registers.  Especially on the output side, FPGAs don't handle this quite well.  Quartus gave me a 'restricted FMAX' of 250Mhz, but the data is mush, or, the data is just valid at the last few picoseconds.  To be useful like this, and you want a valid data output window of say 10ns, you could not use this fifo faster than around 75MHz.  I have not considered smaller PLD devices like MAX3000/MAX7000 series.  With optimal chosen IOs, they might actually perform better as they were designed for glue logic / gate driven IO pins and the fabric is tiny compared to an FPGA.
« Last Edit: July 07, 2020, 12:39:59 pm by BrianHG »
 

Offline NorthGuy

  • Super Contributor
  • ***
  • Posts: 3143
  • Country: ca
Xilinx 7-series chips have this built-in in their hardware FIFOs. They call it "First Word Fall Through" (FWFT) mode.
 
The following users thanked this post: BrianHG

Offline asmi

  • Super Contributor
  • ***
  • Posts: 2731
  • Country: ca
Xilinx 7-series chips have this built-in in their hardware FIFOs. They call it "First Word Fall Through" (FWFT) mode.
They have quite high minimum depth (16 IIRC), if you need less you will need to improvise.
 
The following users thanked this post: BrianHG

Offline BrianHGTopic starter

  • Super Contributor
  • ***
  • Posts: 7727
  • Country: ca
Same with Altera's internal IP FIFO, with an alleged warning that some of these mode settings may impact performance, however, my code has no vendor specific IP/module tied to it.
« Last Edit: July 07, 2020, 02:24:40 pm by BrianHG »
 

Offline asmi

  • Super Contributor
  • ***
  • Posts: 2731
  • Country: ca
Using blocking assignments inside clocked blocks is not a good idea in my opinion, I would also use actual SV constructs instead of legacy Verilog ones - like "always_ff", "logic" instead of regs (though I do still use wires to underscore that they are not FFs where this difference is important), using SV-style parameter declarations ("#( parameter X = 42)"), etc. But these things are mostly a matter of taste.

Offline hamster_nz

  • Super Contributor
  • ***
  • Posts: 2803
  • Country: nz
Nice work!

Must play havoc with your timing closure though????
     Only if you use the FIFO wired directly to and from IO pins.  If the FIFO is internal, the compiler would consider the 'WIRING' from it's input port to the output port just part of the logic gates, or a mux selection logic within your design.  I still get an fmax in the 350MHz range on the slowest CycloneIV.  This is the reason for making the FIFO only 4 bytes exactly.  Anything width up to 36 bits will give me that theoretical 350MHz range.  Or in other words, inserting such a FIFO anywhere inside my existing designs won't lower my design's current fmax.

     However, wired to IO pins, you need to take into account the input's tsu and un-latched data output registers behaving like a mux or gates, not clocked registers.  Especially on the output side, FPGAs don't handle this quite well.  Quartus gave me a 'restricted FMAX' of 250Mhz, but the data is mush, or, the data is just valid at the last few picoseconds.  To be useful like this, and you want a valid data output window of say 10ns, you could not use this fifo faster than around 75MHz.  I have not considered smaller PLD devices like MAX3000/MAX7000 series.  With optimal chosen IOs, they might actually perform better as they were designed for glue logic / gate driven IO pins and the fabric is tiny compared to an FPGA.

Doesn't this trouble when zero_latency is 1, linking the timing paths leading up to "data_in" with those downstream from "data_out", with a couple of levels of logic added in?

Code: [Select]
assign data_out       = ( fifo_size == 0 ) && (zero_latency ) ?  data_in : fifo_data_reg[fifo_rd_pos[1:0]] ;

As I count it, a bit of data_out depends on 9 bits of input (or maybe less, depending if fifo_data_reg is inferred as RAM)...


Gaze not into the abyss, lest you become recognized as an abyss domain expert, and they expect you keep gazing into the damn thing.
 

Offline BrianHGTopic starter

  • Super Contributor
  • ***
  • Posts: 7727
  • Country: ca
Nice work!

Must play havoc with your timing closure though????
     Only if you use the FIFO wired directly to and from IO pins.  If the FIFO is internal, the compiler would consider the 'WIRING' from it's input port to the output port just part of the logic gates, or a mux selection logic within your design.  I still get an fmax in the 350MHz range on the slowest CycloneIV.  This is the reason for making the FIFO only 4 bytes exactly.  Anything width up to 36 bits will give me that theoretical 350MHz range.  Or in other words, inserting such a FIFO anywhere inside my existing designs won't lower my design's current fmax.

     However, wired to IO pins, you need to take into account the input's tsu and un-latched data output registers behaving like a mux or gates, not clocked registers.  Especially on the output side, FPGAs don't handle this quite well.  Quartus gave me a 'restricted FMAX' of 250Mhz, but the data is mush, or, the data is just valid at the last few picoseconds.  To be useful like this, and you want a valid data output window of say 10ns, you could not use this fifo faster than around 75MHz.  I have not considered smaller PLD devices like MAX3000/MAX7000 series.  With optimal chosen IOs, they might actually perform better as they were designed for glue logic / gate driven IO pins and the fabric is tiny compared to an FPGA.

Doesn't this trouble when zero_latency is 1, linking the timing paths leading up to "data_in" with those downstream from "data_out", with a couple of levels of logic added in?

Code: [Select]
assign data_out       = ( fifo_size == 0 ) && (zero_latency ) ?  data_in : fifo_data_reg[fifo_rd_pos[1:0]] ;

As I count it, a bit of data_out depends on 9 bits of input (or maybe less, depending if fifo_data_reg is inferred as RAM)...
With zero_latency set to 0, the downstream logic will always be a 4 position mux selecting 1 of the 4 fifo_data_regs selected by the 2 bit address fifo_rd_pos.

With zero_latency set to 1, we now have a 3 bit address.  The MSB address is tied to the output of a 3 input OR gate whose inputs are tied to the 3 bit counter fifo_size.  The inputs of that now 8:1 mux selector have the first 4 mux inputs tied in parallel to the data coming data_in (basically a register from somewhere in you existing design) while the top 4 inputs of that 8:1 mux are tied to the 4 fifo_data_regs.

Since you will be feeding the the data in from another register in your FPGA anyways, just like the 4 registers fifo_data_reg, the penalty between going from zero_latency off to on is equivalent to switching between a 4:1 mux memory register bank to a 8:1 mux memory register bank, with the first 4 of those registers tied together from with your data_in (again, still another reg in your FPGA design) and the MSB addr of the mux selector if tied to the output of a 3 input OR gate whose inputs come from the 3 bit counter fifo_size.

 |O Dam it hamster_nz (I don't know why I didn't realize it myself), I just thought of a way to eliminate that 3 input OR gate making it a single register bit just like the first 2 bits fifo_rd_pos, hence shaving off a nanosecond.  As fast as the current design is, it would still be an improvement for slower FPGAs or old PLDs.

Updated V2 code coming with timing comparison if possible (meaning the current code is already so fast).
Note that I will code it so that the 8:1 mux becomes obvious & I will adapt my simulation test bench to reflect the FIFO being inside a design where it is being fed by a set of registers, and the outputs will be latched by another set of registers.  The compiler will then provide a proper valid FMAX reading and penalty between zero_latency on and off when using the FIFO buried inside a design in this manner.

Note: zero_latency doesn't count as an input as it's a fixed parameter removed at compile time.
« Last Edit: July 07, 2020, 11:46:01 pm by BrianHG »
 

Offline BrianHGTopic starter

  • Super Contributor
  • ***
  • Posts: 7727
  • Country: ca
Get final version 3.1 here: V3.1 with 7 FIFO word mode and underflow/overflow protection.

Here is obsolete version 2.0.  I've also added an overflow and underflow protection feature to prevent FIFO corruption if too many reads or writes are requested.

This version has only a single 2 input OR gate driving the 'fifo_not_empty' flag when in zero_latency mode called "First Word Fall Through" (FWFT) mode while the output data mux selector no longer has a 3 input OR gate to select the data_in in FWFT mode.  The timing improvement will be shown in the next post.

Code: [Select]

// *****************************************************************
// *** FIFO_3word_0_latency.sv V2.0, July 7, 2020
// ***
// *** This 3 word + 1 reserve word Zero latency FIFO
// *** with look ahead data and status flags outputs was
// *** written by Brian Guralnick.
// ***
// *** See the included 'FIFO_0_latency.png' simulation for functionality.
// ***
// *** Using System Verilog code which only uses synchronous logic.
// *** Well commented for educational purposes.
// *****************************************************************

module FIFO_3word_0_latency (

input wire clk,                  // CLK input
input wire reset,                // reset FIFO

input  wire shift_in,            // load a word into the FIFO.
input  wire shift_out,           // shift data out of the FIFO.
input  wire [bits-1:0] data_in,  // data word input.

output wire fifo_not_empty,      // High when there is data available.
output wire fifo_full,           // High when the FIFO is 3 words full.
                                 // *** Note the FIFO has 1 extra word of free space after
                                 // *** the fifo_full flag goes high, so it's actually a 4 word FIFO.

output wire [bits-1:0] data_out  // FIFO data word output
);

//*************************************************************************************************************************************
parameter bits = 8 ;                // sets the width of the fifo
parameter zero_latency = 1;         // When set to 1, if the FIFO is empty, the data_out and fifo_empty flag will
                                    // immediately reflect the state of the inputs data_in and shift_in, 0 clock cycle delay.
                                    // When set to 0, like a normal synchronous FIFO, It will take 1 clock cycle before the
                                    // fifo_empty flag goes high and the data_out will have a copy of the data_in after a 'shift_in'.

                                    // Enabling the overflow/underflow protection features may lower top FMAX.
parameter overflow_protection  = 0; // Prevents internal write position and writing if the fifo is full past the 1 extra reserve word
parameter underflow_protection = 0; // Prevents internal position position increment if the fifo is empty
//*************************************************************************************************************************************

reg  [bits-1:0] fifo_data_reg[3:0] ;                  // FIFO memory
reg  [1:0]      fifo_wr_pos, fifo_rd_pos;             // read and write memory pointers
reg  [7:0]      fifo_position = 8'b11100001 ;         // The fifo empty location

wire [2:0]      read_pointer ;                        // read data mux pointer
wire [bits-1:0] fifo_data_mux[7:0] ;                  // the mux data inputs

assign          fifo_data_mux[3:0] = fifo_data_reg[3:0]; // mux selection from fifo register data.
assign          fifo_data_mux[4]   = data_in ;           // mux selection from data input
assign          fifo_data_mux[5]   = data_in ;           // mux selection from data input
assign          fifo_data_mux[6]   = data_in ;           // mux selection from data input
assign          fifo_data_mux[7]   = data_in ;           // mux selection from data input

assign   read_pointer[1:0] =  fifo_rd_pos[1:0];                  // adress the 4 fifo words
assign   read_pointer[2]   = ~fifo_position[1] && zero_latency ; // when high, address the data_in on the top 4 mux positions.
assign   data_out          =  fifo_data_mux[read_pointer];       // mux select the data output.

//assign data_out        = ( ~fifo_position[1] && zero_latency ) ?  data_in : fifo_data_reg[fifo_rd_pos[1:0]] ; // When FIFO is empty and parameter zero_latency = 1,
                                                                                       // bypass the memory registers and wire the output directly to the input.
                                                                                       // Otherwise show the correct FIFO memory register once it is latched.

assign fifo_not_empty    =  fifo_position[1] || (zero_latency && shift_in);            // While the FIFO is empty and zero_latency = 1, directly wire
                                                                                       // 'fifo_not_empty' output to the 'shift_in' input.  Otherwise,
assign fifo_full         =  fifo_position[3] ;                                         // only set high once there is data in the FIFO.

wire   shift_in_protect, shift_out_protect;
assign shift_in_protect  =  shift_in  && ( ~fifo_position[4] || (overflow_protection  !=1) );// Do not allow a shift_in if the FIFO is full past the fourth reserve word
assign shift_out_protect =  shift_out && ( fifo_not_empty    || (underflow_protection !=1) );// Do not allow a shift_out if the FIFO is empty

always @ (posedge clk) begin

if (reset) begin

fifo_data_reg[0] <= 0 ;  // clear the FIFO's memory contents
fifo_data_reg[1] <= 0 ;  // clear the FIFO's memory contents
fifo_data_reg[2] <= 0 ;  // clear the FIFO's memory contents
fifo_data_reg[3] <= 0 ;  // clear the FIFO's memory contents

fifo_rd_pos       <= 2'd0;             // reset the FIFO memory counter
fifo_wr_pos       <= 2'd0;             // reset the FIFO memory counter
fifo_position     <= 8'b11100001;      // The fifo empty location

end else begin

            if (  shift_in_protect && ~shift_out_protect ) fifo_position[7:0] <= {fifo_position[6:0],fifo_position[7]}; // Rotate the FIFO position left
       else if ( ~shift_in_protect &&  shift_out_protect ) fifo_position[7:0] <= {fifo_position[0],fifo_position[7:1]}; // Rotate the FIFO position right

                 if ( shift_in_protect  ) begin
                      fifo_wr_pos                     <= fifo_wr_pos + 1'b1 ;
                      fifo_data_reg[fifo_wr_pos[1:0]] <= data_in ;
                      end
                 if ( shift_out_protect ) begin
                      fifo_rd_pos                     <= fifo_rd_pos + 1'b1 ;
                      end
       
   end // ~reset


end // always @ (posedge clk) begin
endmodule


« Last Edit: July 12, 2020, 02:35:13 am by BrianHG »
 

Offline BrianHGTopic starter

  • Super Contributor
  • ***
  • Posts: 7727
  • Country: ca
This simulation shows the response improvement of the 'fifo_not_empty' flag in FWFT mode with a shift in:



Here is a view of then improvement of the data_out timing when shifting out the last word in the FIFO as it enters FWFT mode passing through the data_in:



These timing simulations were done from IO pins to IO pins on a CycloneIII -8 FPGA.  They do not represent the authentic performance when using the FIFO internally in a system design as the compiler simplifies out and re-times registers to achieve full timing closure.  However, embedded in a design, the V1 had an FMAX of 308MHz & the V2 had an FMAX of 290MHz.  Though it sounds off that the V2 has a lower FMAX, it is because the V2 has an 8 bit bi-directional shift register in place of a 3 bit counter in V1.  If you do not need the above 290MHz performance, V2's improved FWFT MUX response will make it easier for your compiler to achieve it's timing closure goals with less register re-timing during fitting.
« Last Edit: July 12, 2020, 02:53:27 am by BrianHG »
 

Offline asmi

  • Super Contributor
  • ***
  • Posts: 2731
  • Country: ca
Here is a fifo I use in my 64bit RISC-V core (it's a FWFT fifo which I called elastic_fifo for some reason which I don't remember anymore ::) ):
Code: [Select]
module elastic_fifo #(
    parameter type data_type = bit[31:0],
    parameter MAX_DEPTH = 4,
    parameter ALMOST_FULL_THRESHOLD = 2
)(
    input sysclk,
    input reset,

    elastic_fifo_write_intf.fifo producer,
    elastic_fifo_read_intf.fifo consumer
);

localparam PTR_WIDTH = $clog2(MAX_DEPTH);
localparam DEFAULT_PTR = {PTR_WIDTH{1'b0}};

//http://billauer.co.il/reg_fifo.html

data_type mem[0:MAX_DEPTH-1];

bit [PTR_WIDTH:0] write_ptr = 0;
bit [PTR_WIDTH:0] read_ptr = 0;

wire empty_int = (write_ptr[PTR_WIDTH] ==  read_ptr[PTR_WIDTH]);
wire full_or_empty = (write_ptr[PTR_WIDTH-1:0] == read_ptr[PTR_WIDTH-1:0]);

assign producer.is_full = full_or_empty & ~empty_int;
wire is_empty = full_or_empty & empty_int;

wire [PTR_WIDTH:0] diff = write_ptr - read_ptr;
assign producer.is_almost_full = diff > (MAX_DEPTH - ALMOST_FULL_THRESHOLD);

bit dout_valid = 0;
wire fifo_rd_en = ~is_empty && (!dout_valid || consumer.read_en);
assign consumer.is_empty = ~dout_valid;
assign consumer.data_valid = dout_valid;

data_type dout;

assign consumer.data = dout;

always_ff @(posedge sysclk) begin
    if(reset) begin
        dout_valid <= 1'b0;
    end else begin
        if (fifo_rd_en)
            dout_valid <= 1'b1;
        else if (consumer.read_en)
            dout_valid <= 1'b0;
    end
end

always_ff @(posedge sysclk) begin
    if(reset) begin
        write_ptr <= DEFAULT_PTR;
        read_ptr <= DEFAULT_PTR;
        dout <= '{default:'0};
    end else begin
        if (producer.write_en  && ~producer.is_full ) begin
            mem[write_ptr[PTR_WIDTH-1 -:PTR_WIDTH]] <= producer.data;
            write_ptr <= write_ptr + 1;
        end

        if (fifo_rd_en) begin
            dout <= mem[read_ptr[PTR_WIDTH-1 -:PTR_WIDTH]];
            read_ptr <= read_ptr + 1;
        end
    end
end
Interfaces are defined as follows:
Code: [Select]
//  Interface: elastic_fifo_write_intf
//
interface elastic_fifo_write_intf #(
    parameter type data_type = logic
);
    data_type data;
    logic write_en;
    logic is_full;
    logic is_almost_full;

modport producer(input is_full, is_almost_full, output data, write_en);
modport fifo(output is_full, is_almost_full, input data, write_en);

endinterface: elastic_fifo_write_intf

//  Interface: elastic_fifo_read_intf
//
interface elastic_fifo_read_intf #(
    parameter type data_type = logic
);
    data_type data;
    logic data_valid;
    logic read_en;
    logic is_empty;

modport consumer(input is_empty, data, data_valid, output read_en);
modport fifo(output is_empty, data, data_valid, input read_en);

endinterface: elastic_fifo_read_intf 
I use it in fetch module to store code data that comes from the memory, it stores 130-bit wide structures (64bit address, 64bit data, 2bit epoch). Works fine at least at 181 MHz clock (it's limited by a different block of CPU, so not sure how much higher can it go) on Artix-7/Spartan-7 fabric. On Kintex-7 fabric the same code goes above 300 MHz IIRC.
« Last Edit: July 08, 2020, 04:26:03 am by asmi »
 

Offline BrianHGTopic starter

  • Super Contributor
  • ***
  • Posts: 7727
  • Country: ca
Get final version 3.1 here: V3.1 with 7 FIFO word mode and underflow/overflow protection.


Obsolete 7 word version:
Code: [Select]

// *****************************************************************
// *** FIFO_7word_0_latency.sv V2.0, July 8, 2020
// ***
// *** This 7 word + 1 reserve word Zero latency FIFO
// *** with look ahead data and status flags outputs was
// *** written by Brian Guralnick.
// ***
// *** See the included 'FIFO_0_latency.png' simulation for functionality.
// ***
// *** Using System Verilog code which only uses synchronous logic.
// *** Well commented for educational purposes.
// *****************************************************************

module FIFO_7word_0_latency (

input wire clk,                  // CLK input
input wire reset,                // reset FIFO

input  wire shift_in,            // load a word into the FIFO.
input  wire shift_out,           // shift data out of the FIFO.
input  wire [bits-1:0] data_in,  // data word input.

output wire fifo_not_empty,      // High when there is data available.
output wire fifo_full,           // High when the FIFO is 7 words full.
                                 // *** Note the FIFO has 1 extra word of free space after
                                 // *** the fifo_full flag goes high, so it's actually a 8 word FIFO.

output wire [bits-1:0] data_out  // FIFO data word output
);

//*************************************************************************************************************************************
parameter bits = 8 ;                // sets the width of the fifo
parameter zero_latency = 1;         // When set to 1, if the FIFO is empty, the data_out and fifo_empty flag will
                                    // immediately reflect the state of the inputs data_in and shift_in, 0 clock cycle delay.
                                    // When set to 0, like a normal synchronous FIFO, It will take 1 clock cycle before the
                                    // fifo_empty flag goes high and the data_out will have a copy of the data_in after a 'shift_in'.

                                    // Enabling the overflow/underflow protection features may lower top FMAX.
parameter overflow_protection  = 0; // Prevents internal write position and writing if the fifo is full past the 1 extra reserve word
parameter underflow_protection = 0; // Prevents internal position position increment if the fifo is empty
//*************************************************************************************************************************************

reg  [bits-1:0] fifo_data_reg[7:0] ;                   // FIFO memory
reg  [2:0]      fifo_wr_pos, fifo_rd_pos;              // read and write memory pointers
reg  [15:0]     fifo_position = 16'b1111111000000001 ; // The fifo empty location

wire [3:0]      read_pointer ;                        // read data mux pointer
wire [bits-1:0] fifo_data_mux[15:0] ;                 // the mux data inputs

assign          fifo_data_mux[7:0] = fifo_data_reg[7:0]; // mux selection from fifo register data.
assign          fifo_data_mux[8]   = data_in ;           // mux selection from data input
assign          fifo_data_mux[9]   = data_in ;           // mux selection from data input
assign          fifo_data_mux[10]  = data_in ;           // mux selection from data input
assign          fifo_data_mux[11]  = data_in ;           // mux selection from data input
assign          fifo_data_mux[12]  = data_in ;           // mux selection from data input
assign          fifo_data_mux[13]  = data_in ;           // mux selection from data input
assign          fifo_data_mux[14]  = data_in ;           // mux selection from data input
assign          fifo_data_mux[15]  = data_in ;           // mux selection from data input

assign   read_pointer[2:0] =  fifo_rd_pos[2:0];                  // adress the 4 fifo words
assign   read_pointer[3]   = ~fifo_position[1] && zero_latency ; // when high, address the data_in on the top 4 mux positions.
assign   data_out          =  fifo_data_mux[read_pointer];       // mux select the data output.

//assign data_out        = ( ~fifo_position[1] && zero_latency ) ?  data_in : fifo_data_reg[fifo_rd_pos[1:0]] ; // When FIFO is empty and parameter zero_latency = 1,
                                                                                       // bypass the memory registers and wire the output directly to the input.
                                                                                       // Otherwise show the correct FIFO memory register once it is latched.

assign fifo_not_empty    =  fifo_position[1] || (zero_latency && shift_in);            // While the FIFO is empty and zero_latency = 1, directly wire
                                                                                       // 'fifo_not_empty' output to the 'shift_in' input.  Otherwise,
assign fifo_full         =  fifo_position[7] ;                                         // only set high once there is data in the FIFO.

wire   shift_in_protect, shift_out_protect;
assign shift_in_protect  =  shift_in  && ( ~fifo_position[8] || (overflow_protection  !=1) );// Do not allow a shift_in if the FIFO is full past the fourth reserve word
assign shift_out_protect =  shift_out && ( fifo_not_empty    || (underflow_protection !=1) );// Do not allow a shift_out if the FIFO is empty

always @ (posedge clk) begin

if (reset) begin

fifo_data_reg[0] <= 0 ;  // clear the FIFO's memory contents
fifo_data_reg[1] <= 0 ;  // clear the FIFO's memory contents
fifo_data_reg[2] <= 0 ;  // clear the FIFO's memory contents
fifo_data_reg[3] <= 0 ;  // clear the FIFO's memory contents
fifo_data_reg[4] <= 0 ;  // clear the FIFO's memory contents
fifo_data_reg[5] <= 0 ;  // clear the FIFO's memory contents
fifo_data_reg[6] <= 0 ;  // clear the FIFO's memory contents
fifo_data_reg[7] <= 0 ;  // clear the FIFO's memory contents

fifo_rd_pos       <= 3'd0;                  // reset the FIFO memory counter
fifo_wr_pos       <= 3'd0;                  // reset the FIFO memory counter
fifo_position     <= 16'b1111111000000001;  // The fifo empty location

end else begin

            if (  shift_in_protect && ~shift_out_protect ) fifo_position[15:0] <= {fifo_position[14:0],fifo_position[15]  }; // Rotate the FIFO position left
       else if ( ~shift_in_protect &&  shift_out_protect ) fifo_position[15:0] <= {fifo_position[0]   ,fifo_position[15:1]}; // Rotate the FIFO position right

                 if ( shift_in_protect  ) begin
                      fifo_wr_pos                     <= fifo_wr_pos + 1'b1 ;
                      fifo_data_reg[fifo_wr_pos[2:0]] <= data_in ;
                      end
                 if ( shift_out_protect ) begin
                      fifo_rd_pos                     <= fifo_rd_pos + 1'b1 ;
                      end
       
   end // ~reset


end // always @ (posedge clk) begin
endmodule




CycloneIV-8 3word x 36bit FMAX 245MHz, 7word x 36bit FMAX 222MHz.
« Last Edit: July 12, 2020, 02:35:32 am by BrianHG »
 

Offline NorthGuy

  • Super Contributor
  • ***
  • Posts: 3143
  • Country: ca
The performance loss will not be seen until you put it into operation. All the logic which the writer uses to produce shift_in and data_in, all the logic within fifo which produces fifo_not_empty and data_out, all the logic the reader uses to analyze fifo_not_empty, process data and set shift_out, will have to be evaluated consecutively during the same clock cycle. Thus, it may be fast when writer and reader are simple test entities, but if they become more complex, the max clock frequency will drop.
 

Offline Sal Ammoniac

  • Super Contributor
  • ***
  • Posts: 1668
  • Country: us
SystemVerilog, but you're still using antique reg and wire types and not using always_ff? Why?
Complexity is the number-one enemy of high-quality code.
 
The following users thanked this post: BrianHG

Offline BrianHGTopic starter

  • Super Contributor
  • ***
  • Posts: 7727
  • Country: ca
The performance loss will not be seen until you put it into operation. All the logic which the writer uses to produce shift_in and data_in, all the logic within fifo which produces fifo_not_empty and data_out, all the logic the reader uses to analyze fifo_not_empty, process data and set shift_out, will have to be evaluated consecutively during the same clock cycle. Thus, it may be fast when writer and reader are simple test entities, but if they become more complex, the max clock frequency will drop.
Yes, that performance drop, now with V2, is the addition of 1 additional input to a preceding gates which derived the control/enable of your next stage.  And for the data, the speed of the mux pass through.  Depending on your compiler's ability to restructure multiplexers and feed data through them, enabling the zero_latency will may slow down your FMAX if this FIFO module is the most complex bottleneck part of your design.

Enabling "First Word Fall Through" (FWFT) mode in Intel's internal FIFO has an equivalent performance hit (Worse than my version, but they can go up to kilobyte size FIFOs using ram blocks).  I cannot speak for Xilinx.

Example not empty in zero_latency mode:
fifo_not_empty    =  fifo_position[1] || shift_in;

That clocked 1 synchronous register output OR'd with the shift_in.  And if you know your compiler, it will push back the that shift in to you previous logic which would meet the next part of your design if you did not have the FIFO and what you would call the shift in would transparently go onto the next stage in your design.

Like I said, V2 got rid of all the gates involved the comparing a 3 bit number to a size quantity, then ORing that compare result with your fed shift_in input, then that would feed your next stage.  You cannot go faster than my current V2 design.  The official FWFT FIFOs form Xilinx and Intel arent as efficient as my trick to get rid of counters and associated logic to determine the amount of data inside the FIFO becomes absurd beyond an 8 word FIFO.  Though, with some smart coding, the trick can be implemented in a different way.
« Last Edit: July 08, 2020, 05:00:49 pm by BrianHG »
 

Offline BrianHGTopic starter

  • Super Contributor
  • ***
  • Posts: 7727
  • Country: ca
SystemVerilog, but you're still using antique reg and wire types and not using always_ff? Why?
Sorry, bad habit.  I started thin in regular Verilog, then, wanting to address arrays, .sv just made that easy to add the [x:x] after the registers.

Fixed in version 3.0 here: V3.0 with 7 FIFO word mode and underflow/overflow protection.
« Last Edit: July 11, 2020, 08:10:04 pm by BrianHG »
 

Offline NorthGuy

  • Super Contributor
  • ***
  • Posts: 3143
  • Country: ca
Depending on your compiler's ability to restructure multiplexers and feed data through them, enabling the zero_latency will may slow down your FMAX if this FIFO module is the most complex bottleneck part of your design.

There will be entities which use your fifo. They will have their own logic, which may or may not be complex. With zero latency you get the following combinatory chain:

fifo_full -> writer's logic outside of your FIFO -> shift_in -> fifo_not_empty -> reader's logic outside of your FIFO -> shift_out

Without zero latency you get two parallel chains:

fifo_full -> writer's logic outside of your FIFO -> shift_in -> fifo_not_empty

fifo_not_empty -> reader's logic outside of your FIFO -> shift_out

FMax decreases not because you add a mux inside FIFO, but because you combine writer's and reader's logic dealing with FIFO into a single combinatory chain. Without zero latency, the FIFO pipelines this chain into two - you get the data a clock later, but you can run the clock faster. Which one is better depends on the circumstances.

 

Offline asmi

  • Super Contributor
  • ***
  • Posts: 2731
  • Country: ca
  The official FWFT FIFOs form Xilinx and Intel arent as efficient as my trick to get rid of counters and associated logic to determine the amount of data inside the FIFO becomes absurd beyond an 8 word FIFO.
I can't speak for Intel, but Xilinx has FWFT mode implemented in silicon, so zero logic is required to use it. And it works in asynchronous mode (when read and write side are clocked by two unrelated clocks) too. What about yours?

Offline BrianHGTopic starter

  • Super Contributor
  • ***
  • Posts: 7727
  • Country: ca
  The official FWFT FIFOs form Xilinx and Intel arent as efficient as my trick to get rid of counters and associated logic to determine the amount of data inside the FIFO becomes absurd beyond an 8 word FIFO.
I can't speak for Intel, but Xilinx has FWFT mode implemented in silicon, so zero logic is required to use it. And it works in asynchronous mode (when read and write side are clocked by two unrelated clocks) too. What about yours?
My design is a single clock FIFO for linking 2 parts of your design together where 1 may build up data or commands faster than the execution unit on the output side.  It should work predictably across Intel, Lattice, Xilinx, or even a custom ASIC you might be considering.  It was not designed to take up any dedicated ram blocks or eat excessive amounts of logic other than the data registers.

A true FWFT mode still needs to pass data from the input to output.  Does Xilinx specify this transmission of delay in ns?  What happens when you cross clock domains?

At least with mine, your compiler's reported FMAX means that the data driven through is ready by the next clock cycle.

Q: Is the Xilinx FWFT mode a functional feature, regardless of the delay so if you use the same CLK in and CLK out, will the output data and flags be all valid before the next clock cycle?  Since you say this is a hardware block, it should have a predictable fixed delay.  I am not questioning whether it is implemented in dedicated hardware or not.
 

Offline NorthGuy

  • Super Contributor
  • ***
  • Posts: 3143
  • Country: ca
Q: Is the Xilinx FWFT mode a functional feature, regardless of the delay so if you use the same CLK in and CLK out, will the output data and flags be all valid before the next clock cycle?  Since you say this is a hardware block, it should have a predictable fixed delay.  I am not questioning whether it is implemented in dedicated hardware or not.

The FWFT mode is for dual clocks. The writer uses his clock to write FIFO. The data will appear on the reader's end without waiting for the reader's read strobe. This adds 1 to the depth of the FIFO. The asymchronous more requires internal synchronization, so the latency is rather high.

They have synchronous mode too. It requires a special EN_SYNC flag, so it probably ensures better latency. I don't know what is the latency through synchronous FIFO.

It is not combinatorial like yours (if I understand Verilog of your V2 correctly), where the reader can process the data and capture them on the very first clock edge - the same clock edge which is used by writer to write the data into the memory.
 
The following users thanked this post: BrianHG

Offline BrianHGTopic starter

  • Super Contributor
  • ***
  • Posts: 7727
  • Country: ca
You understand mine fine.  The combinational logic structure is taken into account by the compiler during compile and for many apps, you can completely shave off a clock cycle.

This can only work with a good FMAX this way so because the data is going through a small 5:1 mux, and the MUX selector is a registered 2 bit counter plus 1 bit registered fifo__not_empty flag without any combinational logic gates to derive that MUX selection, unlike my V1 which had an added almost 1ns added penalty when switching the MUX into fall through mode as seen in my second timing diagram.

Thinking things out further, there is 1 last improvement I can make.  If it improves FMAX, I'll update to V3 code.
« Last Edit: July 08, 2020, 09:04:18 pm by BrianHG »
 

Offline NorthGuy

  • Super Contributor
  • ***
  • Posts: 3143
  • Country: ca
Thinking things out further, there is 1 last improvement I can make.  If it improves FMAX, I'll update to V3 code.

Sure you can improve your combinatorial delay by a lot. All you need on your critical path (if you can call it that) is a single mux which selects either data_in or some temporary register prefetched from the FIFO's memory on the previous clock. The selector for the mux can also be pre-calculated on the previous clock. So, it's only one logic level you need.

I don't think Fmax is a meaningful term here. The path for Fmax calculation is measured from flop to flop. Both launching and receiving flops are outside of your module. If you just put the registers before and after your FIFO, the path from the in register to the out register will be faster than internal paths inside your module, so the tools will not even try to optimize it. Your Fmax will rather reflect the performance of the internal paths (which start and end withing your FIFO). On the other hand, when you connect your FIFO module to other modules (or to package IO pins for that matter, as you have already tried), the in-to-out path will be much slower than your internal paths, so the performance of internal paths will not be important any more. Thus, what you want to optimize is the combinatorial belay from data_in to data_out, not the Fmax of your internal paths.
 

Offline BrianHGTopic starter

  • Super Contributor
  • ***
  • Posts: 7727
  • Country: ca
Thinking things out further, there is 1 last improvement I can make.  If it improves FMAX, I'll update to V3 code.

Sure you can improve your combinatorial delay by a lot. All you need on your critical path (if you can call it that) is a single mux which selects either data_in or some temporary register prefetched from the FIFO's memory on the previous clock. The selector for the mux can also be pre-calculated on the previous clock. So, it's only one logic level you need.

I don't think Fmax is a meaningful term here.....
Agreed with point #1 as I was also going also going to dedicate separate pre-calculated registers to feed that mux and fifo_not_empty flag.

As for the FMAX, I did not cheat here.  I did not get Quartus to generate that figure from the simulation of tying the FIFO to IO pins.  Doing so made Quartus give me an FMAX of 372MHz, not the 260MHz I reported where I placed the FIFO in an actual circuit being fed by logic module A, through FIFO, to logic module B.  When compiling with this test fixture, Quartus must take into account the complete combinational path from logic module A  through the FIFO to logic module B ensuring data setup & hold integrity which is why the FMAX suffered to such a degree.  I understand the issue well and made sure my report offered an actual usage case scenario.

The IO pin timing simulation was strictly to evaluate performance difference where trying to Signal tap post fitting my actual use case scenario could not tell me what's going on inside the FPGA fabric since all my net names vanished and were broken down and changed into many signals like multiple oe_product terms and luts.
« Last Edit: July 08, 2020, 10:54:33 pm by BrianHG »
 


Share me

Digg  Facebook  SlashDot  Delicious  Technorati  Twitter  Google  Yahoo
Smf