Electronics > FPGA
Xilinx/AMD Artix-7 Memory and FIFO Behavior
(1/1)
trossin:
I recently updated to Vivado 2024.1 from a 2023 version and got complaints when converting projects due to a new memory generator tool. This required manually regenerating all my memories and FIFOs but things seem to still work. I have found that the Vivado simulation environment is very slow on my 11 year old laptop and have been using Icarus instead on some tiny PWM/UART projects. My goal is to produce high-level behavior models of the memories and FIFOs so that I can use Icarus for more interesting projects.
I read the UG473 "7 Series FPGAs Memory Resources" July 3rd 2019 (which seems to be the latest version). This document explains the RAMs and sync FIFOs OK but leaves a lot out about the asynchronous FIFOs. So I coded up some experiments and simulated with the default simulator and also built projects and observed the behavior with my dsPIC cheap logic analyzer and thought I would share the results in case others are interested. I have built parameterized models that seem to match but want to do more testing before putting them on my web site. Here is what I learned using a xc7a35tcpg236-1 on a Digilent CmodA7 board (punch line: For Async FIFOs use built-in to minimize latency!):
-----------------------------------------------------------------------------------------------------
True dual port RAM with WRITE_FIRST mode for both ports a and b (common clock):
addra==addrb and held and a write to port a (wea) will present the new data on douta and doutb the next cycle.
If the same is true but the write is to port b (web) then the next cycle doutb will have the new data but douta will keep the previous data
and will update a cycle later (2 cycle delay). The access collision table says the other port that is not written is not defined so I guess this is OK.
addra==addrb, dina!=dinb, and write to both ports (wea=web=1) then the memory is not written (the write is
ignored) and the output ports have their original value. Again, the access collision table says memory and
outputs are undefined so this meets the spec.
I see this in simulation and on hardware (7A35T).
Block RAM FIFOs (Common Clock):
Standard (requires a read to get the first word written after the FIFO goes empty):
Once cycle of latency for an empty FIFO
Empty output cannot be used directly as valid signal. Valid can be created with ~empty & rd_en
delayed one clock cycle or ask the generator to create a valid output.
For a 16 entry FIFO, data_count is just 4 bits so the true data_count is {full,data_count[3:0]}.
First Word Fall Through (first word written afte the FIFO goes empty will show up at the output without a read):
Adds and extra cycle of latency for an empty FIFO for a total of two cycles.
The max data count is two more than the FIFO size so a FIFO size of 16 has a 5 bit data_count (0 to 18).
Both types:
Depths are limited to 2^N where N is 4 to 17 for 7A35T.
prog_full asserts one cycle after data_count equals threshold and is cleared once cycle
after data_count equals threshold-1.
Builtin FIFOs (Common Clock):
Seems the same as Block RAM FIFOs but the depths are limited to 2^N where N is 9 to 17 for 7A35T.
No data counts are supported.
Reading from an empty Standard FIFO will put the FIFO into an unknown state in simultaion which can only
be fixed with a reset. Must set the rd_en port to ~empty & rd_en to protect against death.
Shift Register FIFOs (Common Clock):
Same as a Block RAM FIFO but uses flip-flops instead of Block RAM. Only Standard mode is allowed.
First Word Fall Through is not supported. Depth is limited to 2^N where N is 4 to 17.
Distributed RAM FIFOs (Common Clock):
Same as Block RAM FIFO but uses flip-flops instead of Block RAM
Async Block RAM FIFOs (Independent Clocks):
Standard:
Behaves similar to synchronous standard except that using 2 synchronizers
results in a latency of 1 write clock plus 5 read clocks after the first write is clocked in.
Also, selecting a size of 16 results in a FIFO with a depth of 15.
First Word Fall Through:
Behaves similar to the sync version except using 2 synchronizers results in a latency
of 1 write clock plus 6 read clocks after first write. It adds an extra cycle like the
sync version compared to the standard version. Selecting a depth of 16 results in a FIFO of
depth 17 but the read and write counts do not have an extra bit and will only go from 0 to 15 with
full being asserted when the write count goes to 15. There are output or input storage registers to
save the extra two elements. If a programmable full threshold is used and set to 12, the
prog_full output will assert the same cycle that the write count goes to 10. Maybe because
the write count does not include the extra 2 storage regs. The prog_full output will negate
one cycle after the write count goes to 9. This is different behavior from the sync/common clock FIFO.
The wr_data_count increases one full cycle after the wr_en is captured by the rising edge of the clock.
The full output asserts right after the wr_en is captured but the wr_data_count will be one less than
the full count (15 for a FIFO selected to have a size of 16). Full will negate the same cycle the
wr_data_count goes one lower than the full count.
Async Distributed RAM FIFOs (Independent Clocks):
Standard and First Word Fall Through:
Same as Block RAM but async reset does not generate wr_rst_busy nore rd_rst_busy outputs.
Async Builtin FIFOs (Independent Clocks):
Standard and First Word Fall Through:
Same as Block RAM FIFO execpt:
-Latency on empty FIFO write to data valid is 0 write clocks plus 4 read clocks which is must faster
than the non-built in types.
-Async reset does not generate wr_rst_busy nor rd_rst_busy outputs
-Minimum size is 512.
-The data_counts are not supported
-The clock frequencies have to be specified when using full/empty thresholds which have limits that
change based on the frequencies used.
-The actual FIFO size of the Standard FIFO is not one less than requested. It is the requested size.
Reading from an empty Standard or FWFT FIFO will put the FIFO into an unknown state which can only be
fixed with a reset. Must set the rd_en port to ~empty & rd_en to protect against death.
Async Latency: Standard First Word Fall Though (W=Write clock cycles, R=Read clock cycles)
Block RAM 1W 5R 1W 6R
Dist RAM 1W 5R 1W 6R
Built In 0W 4R 0W 4R
asmi:
I have never used this tool in any of my projects. Instead I use (and recommend to everyone) inference whenever I can, and Xilinx parametrized macros when I need something which can't be inferred directly (like FIFOs, or CDC primitives). The latter are described in UG953 for 7 series devices. All of them have simulation models shipped with Vivado. Advantage of using these instead of wizard is that everything is defined in HDL directly, and so making changes (or even making those parameters parametrizable) is trivial.
ejeffrey:
++ on using XPMs for core functions.
We just went through an issue where we had a working design using the xilinx block memory generator wizard but it wasn't meeting timing at the desired frequency. So we added a pipeline delay to the memory and the whole thing broke. Not only did it break, but synthesis and the simulation models broke in different ways. After being jerked around by xilinx support for weeks we got escalated to someone who just told us that the block memory generator is buggy, has always been buggy, and they have no intention of fixing it. In fact they say they have removed the block memory generator tool from their Versal FPGAs. They said to just use the XPM macros which indeed worked fine.
It's pretty unbelievable to me that they would ship such a broken tool for so long, but that is straight from xilinx. There is a warning buried in the documentation not to use the BMG with ultraram but we saw (different!) problems with BRAM and ultraram.
Navigation
[0] Message Index
Go to full version