Author Topic: Home made SystemVerilog 3 word, Zero latency FIFO, documented for beginners.  (Read 6972 times)

0 Members and 1 Guest are viewing this topic.

Online asmi

  • Super Contributor
  • ***
  • Posts: 2797
  • Country: ca
I think what he's trying to say is that since you don't have a register right at the input and output, the Fmax is meaningless because you can have over 9000 logic levels on the input and over 9000 at the output, and this will trash your Fmax down the toilet. If you'd have inputs and outputs registered (which is the gold standard I'm trying to follow whenever I can), then your module's combinatorial paths are not going to be affected in any way by whatever connects to your module because all of them are going to be 100% internal to the module.

Offline NorthGuy

  • Super Contributor
  • ***
  • Posts: 3246
  • Country: ca
As for the FMAX, I did not cheat here.

Sorry, I didn't mean to accuse you of cheating. I think the opposite concern is more important. As you improve your data_in to data_out path, you will get to the point where the Fmax will not show any further improvements because other things inside the FIFO (such as fetching from memory/array) may be slower than the path you're working on.
 

Online BrianHGTopic starter

  • Super Contributor
  • ***
  • Posts: 8088
  • Country: ca
If you'd have inputs and outputs registered (which is the gold standard I'm trying to follow whenever I can), then your module's combinatorial paths are not going to be affected in any way by whatever connects to your module because all of them are going to be 100% internal to the module.
If you want that true 0 latency un-clocked fall through, no matter what I do, the data coming into my module needs to get out.  The fasted way I can implement this is narrowing down that side to an AB mux between data_in and a registered fifo ram word.  Same goes for the fifo_not_empty, though the question here is what's faster, using an AB mux, or a single 2 input OR gate.  Everything else in my design is already registered.  My current output is a 5:1 mux which selects the 1 of 4 registered fifo memory words or have the data_in fall through to the data_out.  I'll be improving that to an AB mux design which only selects between 1 registered word or the data_in.  This change will also vastly improve the 7 word fifo.

Even with so called hardware implemented FIFO, that fall through word needs to get from in to out somehow, whether through a static ram cell or a mux selection at the output, even if that mux is implemented as part of the dedicated fifo hardware, it's still incurs a delay.

The purpose of my fifo is not to create a bridge between 2 clock domains, it's to be small and fast all within 1 system clock domain.  Now, I do have a another 4 word FIFO designed to switch between 2 clock domains and avoid meta-stability issues with only 4 words, but that is a different thread topic.
« Last Edit: July 09, 2020, 03:15:20 am by BrianHG »
 

Offline hamster_nz

  • Super Contributor
  • ***
  • Posts: 2812
  • Country: nz
Possibly dumb question.... Did you consider in the original code that:
Code: [Select]
assign fifo_full      = (fifo_size >= 3'd3);

Could be:

Code: [Select]
assign fifo_full      = (fifo_size >= 3'd3) || shift_out;

...as when the consumer is pulling data from the FIFO you should always be able to insert a new item?

Also, it occurred to me that with such a  small FIFO you could consider just a 6-input lookup table finite state machine - four bits of state with the shift_in and shift_out signals indexing a table of the next state (so 4 x LUT6s) and seven or so LUT4s to MUX control signals and outputs, along with the regs to hold the data? With a little bit of care in selecting the state so that two of the bits corresponded with the read index it seems could be implemented in two levels of logic on a LUT6 architecture.
Gaze not into the abyss, lest you become recognized as an abyss domain expert, and they expect you keep gazing into the damn thing.
 
The following users thanked this post: BrianHG

Online BrianHGTopic starter

  • Super Contributor
  • ***
  • Posts: 8088
  • Country: ca
Possibly dumb question.... Did you consider in the original code that:
Code: [Select]
assign fifo_full      = (fifo_size >= 3'd3);

Could be:

Code: [Select]
assign fifo_full      = (fifo_size >= 3'd3) || shift_out;

...as when the consumer is pulling data from the FIFO you should always be able to insert a new item?

Also, it occurred to me that with such a  small FIFO you could consider just a 6-input lookup table finite state machine - four bits of state with the shift_in and shift_out signals indexing a table of the next state (so 4 x LUT6s) and seven or so LUT4s to MUX control signals and outputs, along with the regs to hold the data? With a little bit of care in selecting the state so that two of the bits corresponded with the read index it seems could be implemented in two levels of logic on a LUT6 architecture.

Correction:
Code: [Select]
assign fifo_full      = (fifo_size >= 3'd3) && ~shift_out; Thanks.  I completely missed that.  With that 1 extra word in my fifo, I was expecting the module feeding the fifo to have a delayed response to the full flag, however, you are absolutely right.

Will be added to V3.

As for your second half, my V3 single 2 input mux for the data_out which will be the fastest possible.  All other regs have solely a 1 or 2 input source control while the 4/8word array will be left to the compiler how it feeds my new 1 register before that final output 2:1 AB mux.  The output flags will be a single logic cell, no additional fanout other than the fifo_not_empty being a 2 input OR gate, 1 input coming again from a single logic cell with no additional fanout (now for fifo_full as well).

I'm writing it in a way which it still appears like the fifo memory is a normal array, but separated that array by an additional register like what you see in dual-port rams where you get a performance boost by registering the data outs.

The final 2:1 AB mux will only have 1 control registered logic cell with no additional fanouts.  1 side of the mux, connected to my fifo memories new register output will have no additional fanouts on that reg's data output.  The only thing I cannot control is where I'm receiving the data_in from the user who is feeding me.

The end goal is to eliminate every bottleneck possible.  Just drop in my fifo and forget as there would be no faster way to get that 0 clk first word fall through capability without any determent to the user's existing design.

As for the finite state machine, if your logic has few enough bits that it will fit, the compiler does coverts your logic to a state machine if will improve design efficiency and it can.
« Last Edit: July 09, 2020, 06:33:48 pm by BrianHG »
 

Online BrianHGTopic starter

  • Super Contributor
  • ***
  • Posts: 8088
  • Country: ca
Here is Version 3.1.
My attempt at the AB-MUX trick didn't improve performance, so here is what I've done:
1. Converted the code to proper System Verilog removing the old fashioned Verilog block/wire/reg assignments.
2. Added a parameter 'size7_ena' which will change the FIFO from 3 words to 7 words.
3. Added hamster_nz patch to the 'fifo_full' flag.

Code: [Select]

// *****************************************************************
// *** FIFO_3word_0_latency.sv V3.1, July 11, 2020
// ***
// *** This 3 word + 1 reserve word Zero latency FIFO
// *** with look ahead data and status flags outputs was
// *** written by Brian Guralnick.
// ***
// *** See the included 'FIFO_0_latency.png' simulation for functionality.
// ***
// *** Using System Verilog code which only uses synchronous logic.
// *** Well commented for educational purposes.
// *****************************************************************

module FIFO_3word_0_latency (

input  logic clk,                 // CLK input
input  logic reset,               // reset FIFO

input  logic shift_in,            // load a word into the FIFO.
input  logic shift_out,           // shift data out of the FIFO.
input  logic [bits-1:0] data_in,  // data word input.

output logic fifo_not_empty,      // High when there is data available.
output logic fifo_full,           // High when the FIFO is 3 words full.
                                  // *** Note the FIFO has 1 extra word of free space after
                                  // *** the fifo_full flag goes high, so it's actually a 4 word FIFO.

output logic [bits-1:0] data_out  // FIFO data word output
);

//*************************************************************************************************************************************
parameter  int bits = 8 ;                // sets the width of the fifo
parameter  bit zero_latency = 1;         // When set to 1, if the FIFO is empty, the data_out and fifo_empty flag will
                                         // immediately reflect the state of the inputs data_in and shift_in, 0 clock cycle delay.
                                         // When set to 0, like a normal synchronous FIFO, It will take 1 clock cycle before the
                                         // fifo_empty flag goes high and the data_out will have a copy of the data_in after a 'shift_in'.

                                         // Enabling the overflow/underflow protection features may lower top FMAX.
parameter  bit overflow_protection  = 0; // Prevents internal write position and writing if the fifo is full past the 1 extra reserve word
parameter  bit underflow_protection = 0; // Prevents internal position position increment if the fifo is empty
parameter  bit size7_ena            = 0; // Set to 0 for 3 words, set to 1 for 7 words.

localparam int add_words            = size7_ena * 4;
//*************************************************************************************************************************************

logic  [bits-1:0]           fifo_data_reg[(3 + add_words):0] ; // FIFO memory
logic  [(1 + size7_ena):0]  fifo_wr_pos, fifo_rd_pos;          // read and write memory pointers
logic  [(2 + size7_ena):0]  fifo_words ;                       // The number of words in the FIFO
logic                       fifo_not_empty_r = 1'b0 ;          // The fifo is not empty register
logic                       fifo_full_r      = 1'b0 ;          // The fifo is the normal +3 word full register
logic                       fifo_full_exr    = 1'b0 ;          // The fifo is at the true +4 word full register

logic [(2 + size7_ena):0]   read_pointer ;                          // read data mux pointer
logic [bits-1:0]            fifo_data_mux[(7 + (add_words * 2)):0]; // the mux data inputs
logic                       shift_in_protect, shift_out_protect;


always_comb begin

for (int i=0               ; i<(4 + add_words)       ; i++) fifo_data_mux[i] = fifo_data_reg[i]; // mux selection from fifo register data.
for (int i=(4 + add_words) ; i<(8 + (add_words * 2)) ; i++) fifo_data_mux[i] = data_in ;         // mux selection from data input

read_pointer[(1 + size7_ena):0] =  fifo_rd_pos[(1 + size7_ena):0];                 // adress the 4 fifo words
read_pointer[(2 + size7_ena)]   =  !fifo_not_empty_r && zero_latency ;             // when high, address the data_in on the top 4 mux positions.
data_out                        =  fifo_data_mux[read_pointer];                    // mux select the data output.

fifo_not_empty                  =  fifo_not_empty_r || (zero_latency && shift_in); // While the FIFO is empty and zero_latency = 1, directly wire
                                                                                   // 'fifo_not_empty' output to the 'shift_in' input.  Otherwise,
                                                                                   // only set high once there is data in the FIFO.

fifo_full                       =  fifo_full_r && !(shift_out && zero_latency) ;   // Goes high when the FIFO has 3 words in storage and shift_out
                                                                                   // isn't currently being requested while in zero_latency mode.

shift_in_protect  =  shift_in  && ( !(fifo_full_exr && !(shift_out && zero_latency)) || !overflow_protection  ); // Do not allow a shift_in if the FIFO is full past the fourth reserve word
shift_out_protect =  shift_out && (   fifo_not_empty                                 || !underflow_protection ); // Do not allow a shift_out if the FIFO is empty

end // always_comb

always_ff @(posedge clk) begin

if (reset) begin

for (int i=0 ; i<(4 + add_words) ; i++) fifo_data_reg[i] <= 0 ;  // clear the FIFO's memory contents

    fifo_rd_pos      <= 0;  // reset the FIFO memory counter
    fifo_wr_pos      <= 0;  // reset the FIFO memory counter
    fifo_words       <= 0;  // The fifo's number of stored words

    fifo_not_empty_r <= 0 ; // The fifo is not empty register
    fifo_full_r      <= 0 ; // The fifo is the normal +3 word full register
    fifo_full_exr    <= 0 ; // The fifo is at the true +4 word full register

end else begin

                if (  shift_in_protect && !shift_out_protect ) begin
                                                                                                  fifo_words       <= fifo_words + 1'b1; // increment the fifo's number of stored words
                                                                /*if (fifo_words==0)*/            fifo_not_empty_r <= 1;
                                                                 if (fifo_words==(2 + add_words)) fifo_full_r      <= 1;
                                                                 if (fifo_words==(3 + add_words)) fifo_full_exr    <= 1;
       end else if ( !shift_in_protect &&  shift_out_protect ) begin
                                                                                                  fifo_words       <= fifo_words - 1'b1; // decrement the fifo's number of stored words
                                                                 if (fifo_words==1)               fifo_not_empty_r <= 0;
                                                                 if (fifo_words==(3 + add_words)) fifo_full_r      <= 0;
                                                                 if (fifo_words==(4 + add_words)) fifo_full_exr    <= 0;
       end

                 if ( shift_in_protect  ) begin
                      fifo_wr_pos                                   <= fifo_wr_pos + 1'b1 ;
                      fifo_data_reg[fifo_wr_pos[(1 + size7_ena):0]] <= data_in ;
                      end
                 if ( shift_out_protect ) begin
                      fifo_rd_pos                                   <= fifo_rd_pos + 1'b1 ;
                      end
       
   end // !reset

end // always_ff
endmodule



SystemVerilog, but you're still using antique reg and wire types and not using always_ff? Why?
Fixed!  Including 'kfnight's comments in the next post.
« Last Edit: July 12, 2020, 02:42:03 am by BrianHG »
 

Offline kfnight

  • Regular Contributor
  • *
  • Posts: 71
Some last items to embrace SV. Get rid of

Code: [Select]
integer i;
and 1) use int 2) declare it in the for-loop

Code: [Select]
for (int i=(4 + add_words)
Also add types for your parameters. For example you have some boolean parameters (can only be 0 or 1) but they'll probably default to ints if you don't provide a type. So you can have

Code: [Select]
parameter bit overflow_protection  = 0;
and

Code: [Select]
shift_in_protect  =  shift_in  && ( !(fifo_full_exr && !(shift_out && zero_latency)) || !overflow_protection;
Note also that you are mixing logical and bit-wise boolean operators, so all of your ~'s should be !'s.


 
The following users thanked this post: BrianHG

Offline hamster_nz

  • Super Contributor
  • ***
  • Posts: 2812
  • Country: nz
Is it expected that data_out to remain valid when shift_out is not asserted?  (by valid, I mean presenting the last value that passed through the FIFO).

Just wondering if values that shortcut through the FIFO need to be written to the registers at all...
Gaze not into the abyss, lest you become recognized as an abyss domain expert, and they expect you keep gazing into the damn thing.
 

Offline ali_asadzadeh

  • Super Contributor
  • ***
  • Posts: 1930
  • Country: ca
Thanks for sharing, your V3.1 would get 410MHz on a gowin FPGA >:D



Also it would use these LUT and FF resources, I wonder how fast Quartus or Vivado or ISE would synthesize and place and route this! Gowin is surprisingly fast >:D ^-^


The worst thing about Gowin is the lack of simulation for it's IP cores |O |O |O

ASiDesigner, Stands for Application specific intelligent devices
I'm a Digital Expert from 8-bits to 64-bits
 
The following users thanked this post: BrianHG

Offline NorthGuy

  • Super Contributor
  • ***
  • Posts: 3246
  • Country: ca
Is it expected that data_out to remain valid when shift_out is not asserted?  (by valid, I mean presenting the last value that passed through the FIFO).

Just wondering if values that shortcut through the FIFO need to be written to the registers at all...

data_in is written into the array anyway (at least how it was in V2). If shift_out is not asserted, next clock you'll get the same data, but from the array. Presumably, BrianHG has tested everything before posting.
 

Online BrianHGTopic starter

  • Super Contributor
  • ***
  • Posts: 8088
  • Country: ca
Is it expected that data_out to remain valid when shift_out is not asserted?  (by valid, I mean presenting the last value that passed through the FIFO).

Just wondering if values that shortcut through the FIFO need to be written to the registers at all...

data_in is written into the array anyway (at least how it was in V2). If shift_out is not asserted, next clock you'll get the same data, but from the array. Presumably, BrianHG has tested everything before posting.
Remember, when starting with an empty FIFO, the data falls through.  It takes a 'shift_in' when the the data is comited to an array and the data_out is automatically swapped from the data_in to that appropriate array retaining a solid output.  This is exactly what happens in my simulation in post #1.

If you 'shift_out', it means you have taken the current array and requesting the next array will be prepared by the next clock.  If the FIFO is on it's last word, on the next clock, the FIFO will switch to fall through mode the data_in.

Without any 'shift_in', no data will be written to any FIFO registers.

Yes, we are currently using the FIFO in a functional design right now as well.
« Last Edit: July 12, 2020, 05:56:30 pm by BrianHG »
 

Online BrianHGTopic starter

  • Super Contributor
  • ***
  • Posts: 8088
  • Country: ca
Thanks for sharing, your V3.1 would get 410MHz on a gowin FPGA >:D



Also it would use these LUT and FF resources, I wonder how fast Quartus or Vivado or ISE would synthesize and place and route this! Gowin is surprisingly fast >:D ^-^


The worst thing about Gowin is the lack of simulation for it's IP cores |O |O |O
To get a FMAX in a typical use scenario, you should bury all the FIFO's IO's inside D'latch flipflops with the same clock.  That 410MHz may drop as the compiler will now need to consider the combinational path between the D'latch flipflops feeding the inputs through the fifo to the D'latch flipflops you have at the outputs.

Disabling the 0 latency will give you a true FMAX in almost any scenario.  This means the data still falls through, but the 'shift_in' will require 1 clock to affect the data_out (which will now be a register always) and fifo_not_empty flag.

« Last Edit: July 12, 2020, 03:21:17 pm by BrianHG »
 


Share me

Digg  Facebook  SlashDot  Delicious  Technorati  Twitter  Google  Yahoo
Smf