Is this expected behavior?
Thanks.
If you are using the multiport module and the read and write channel are on the same CMD_xxx[ # ] bus, and smart cache is enabled, then you should receive the new data as long as there is 1 spare clock between the 2. ...
Writes to the DDR3 are held off until either a new write is sent outside the current cached address, or, the write cache timer has reached 0 due to no additional writes on that port. The current 'PORT_W_CACHE_TOUT' parameter default is set to 255 CMD_CLKS. ...
You should not see this behavior. Pleave check that the setup time for the write command is placed ahead of the CMD_CLK. It looks as if my module didn't see your write, or, it took the write at this address: (see attached photo)
To be sure when simulating and sending commands, try offsetting the commands you send by 1/2 CMD_CLK phase so that you can see clearly what is being accepted during the 'rise' of the source clock.
Also, the way you are accessing the ram with the set 'write mask', make sure you have the port width set to 128 bits, otherwise nothing will write. You have only bits 96 through 127 write enabled.
If you are using the multiport module and the read and write channel are on the same CMD_xxx[ # ] bus, and smart cache is enabled, then you should receive the new data as long as there is 1 spare clock between the 2. ...
Writes to the DDR3 are held off until either a new write is sent outside the current cached address, or, the write cache timer has reached 0 due to no additional writes on that port. The current 'PORT_W_CACHE_TOUT' parameter default is set to 255 CMD_CLKS. ...I'm only using one element in the CMD_* array. Relevant parameters are:
- PORT_PRIORITY = '{default:0}
- PORT_READ_STACK = '{default:4}
- PORT_W_CACHE_TOUT = '{default:0}
- PORT_CACHE_SMART = '{default:0}
Everything else is the default for the DECA example at 400 MHz.
- PORT_MAX_BURST = '{default:256}
- SMART_BANK = 0
Warning, if 'PORT_CACHE_SMART is not set to '{default 1}, then you will be reading old stale data since the last read.
Enabling the PORT_CACHE_SMART means if a write has been done at any time, if there is a matching read address cached, that read cache data will immediately reflect what was written to the write cache even before the write data has been sent to the DDR3. This parameter should always be on unless you are trying to scrounge up 1 last logic cell on a full FPGA, or get that lat FMAX MHz.
Even with the 'PORT_W_CACHE_TOUT = '{default:0}', meaning a write will go out to the DDR3 ASAP, the DDR3 always operates at a delay since there is a ton of setup involved. My controller is trying to prevent unnecessary DDR3 access whenever possible.
Warning, if 'PORT_CACHE_SMART is not set to '{default 1}, then you will be reading old stale data since the last read.
Enabling the PORT_CACHE_SMART means if a write has been done at any time, if there is a matching read address cached, that read cache data will immediately reflect what was written to the write cache even before the write data has been sent to the DDR3. This parameter should always be on unless you are trying to scrounge up 1 last logic cell on a full FPGA, or get that lat FMAX MHz.
Even with the 'PORT_W_CACHE_TOUT = '{default:0}', meaning a write will go out to the DDR3 ASAP, the DDR3 always operates at a delay since there is a ton of setup involved. My controller is trying to prevent unnecessary DDR3 access whenever possible.
In your comment for 'PORT_CACHE_SMART' you list disabling it for memory testing. I wanted to see each request go to the DDR3 without extra logic surrounding it. 'PORT_W_CACHE_TOUT' was disabled for a similar reason. Starting my development dumb & slow then improving it once the basics work.
I enabled the smart cache and it solved the particular test case, however the behavior was unexpected. I'll just leave it on for now.
I'm trying to use the PHY_SEQ only connected to my custom code. I'm finding that sometimes the CMD_busy signals ends up sticking to 1 and locking all my upstream logic, but not the downstream logic (your block), which ends up performing the same write/read over and over again. This is all with TOGGLE_CONTROLs = 0.
Could the behavior I'm encountering be because of 'CMD_ena' and 'refresh_in_progress' assert at the same time? See the red highlight in pt1.png for that. In pt2.png you can see the busy signal get stuck with hopefully some extra surrounding info.
Also, I just need to make sure that when TOGGLE_CONTROLS=0 the CMD_ena and CMD_busy signals are analogous to something like AXI stream tvalid and tready. It seems like TOGGLE_CONTROLS=1 is your preferred style, would it be better to use that for driving the PHY?
Ohh, 1 other thing about the refresh. After a power-on reset, or reset pulse, the DDR3 will begin to run for around 15 milliseconds before the first initial refresh commands come in. This is a one time thing after power-up and can be seen in some simulations. This does not generate any lost or missing data as the CMD_BUSY flag will properly run if needed. If no CMD_ENA commands are being sent, a small train of sequential refresh commands may run through, but, these additional ones may be interrupted by any CMD_ENA command you send as after the first one, the others are low priority.
if(state)
cmd_ena <= 1;
if(cmd_ena & !cmd_busy)
if(last_xfer)
cmd_ena <= 0;
state <= next_state
*** Build Summary ***
Total logic elements : 7,290 / 49,760 ( 15 % )
Total combinational functions : 6,391 / 49,760 ( 13 % )
Dedicated logic registers : 3,779 / 49,760 ( 8 % )
Total memory bits : 14,304 / 1,677,312 ( < 1 % )
Embedded Multiplier 9-bit elements : 0 / 288 ( 0 % )
Total PLLs : 1 / 4 ( 25 % )
Total pins : 65 / 360 ( 18 % )
*** Memory Test ***
/devices/10M50DA(.|ES)|10M50DC@1#1-2#Arrow MAX 10 DECA/(link)/JTAG/(110:132 v1 #0)/phy_0/master
Started memory test
Finished memory test
Microseconds recorded: 1011776
Number of passes : 0x20000000
Number of failures : 0x00000000
Number of ticks : 0x000f700d
*** Build Summary ***
Total logic elements : 6,241 / 49,760 ( 13 % )
Total combinational functions : 3,264 / 49,760 ( 7 % )
Dedicated logic registers : 4,974 / 49,760 ( 10 % )
Total memory bits : 5,792 / 1,677,312 ( < 1 % )
Embedded Multiplier 9-bit elements : 0 / 288 ( 0 % )
Total PLLs : 1 / 4 ( 25 % )
Total pins : 63 / 360 ( 18 % )
*** Memory Test ***
/devices/10M50DA(.|ES)|10M50DC@1#1-2#Arrow MAX 10 DECA/(link)/JTAG/(110:132 v1 #0)/phy_0/master
Started memory test
Finished memory test
Microseconds recorded: 1030280
Number of passes : 0x20000000
Number of failures : 0x00000000
Number of ticks : 0x000fb950
Hey,
After developing my Avalon bridge to wrap your PHY+PLL, I decided some benchmarks were in order to compare against the Altera UniPHY IP.
All results below were obtained with my synthesizable Avalon memory tester that writes the entire DDR3 with random data then verifies it. It was configured with a 64-bit data bus and max burst of 256 to match what the UniPHY IP wanted. The memory tester is controlled via a separate JTAG-Avalon IP that measures how long the transaction takes with a TCL script. Both DDR3 instances were clocked at 300 MHz with a half-rate Avalon interface.
Altera UniPHYCode: [Select]*** Build Summary ***
Total logic elements : 7,290 / 49,760 ( 15 % )
Total combinational functions : 6,391 / 49,760 ( 13 % )
Dedicated logic registers : 3,779 / 49,760 ( 8 % )
Total memory bits : 14,304 / 1,677,312 ( < 1 % )
Embedded Multiplier 9-bit elements : 0 / 288 ( 0 % )
Total PLLs : 1 / 4 ( 25 % )
Total pins : 65 / 360 ( 18 % )
*** Memory Test ***
/devices/10M50DA(.|ES)|10M50DC@1#1-2#Arrow MAX 10 DECA/(link)/JTAG/(110:132 v1 #0)/phy_0/master
Started memory test
Finished memory test
Microseconds recorded: 1011776
Number of passes : 0x20000000
Number of failures : 0x00000000
Number of ticks : 0x000f700d
Dave's Bridge + BHG PHY/PLLCode: [Select]*** Build Summary ***
Total logic elements : 6,241 / 49,760 ( 13 % )
Total combinational functions : 3,264 / 49,760 ( 7 % )
Dedicated logic registers : 4,974 / 49,760 ( 10 % )
Total memory bits : 5,792 / 1,677,312 ( < 1 % )
Embedded Multiplier 9-bit elements : 0 / 288 ( 0 % )
Total PLLs : 1 / 4 ( 25 % )
Total pins : 63 / 360 ( 18 % )
*** Memory Test ***
/devices/10M50DA(.|ES)|10M50DC@1#1-2#Arrow MAX 10 DECA/(link)/JTAG/(110:132 v1 #0)/phy_0/master
Started memory test
Finished memory test
Microseconds recorded: 1030280
Number of passes : 0x20000000
Number of failures : 0x00000000
Number of ticks : 0x000fb950
Final throughputs are 506 MB/s for UniPHY, and 497 MB/s for your core. It's entirely possible there is some loss of throughput from my bridge having to buffer commands, so I'd be interested in hearing if you've ever done a similar type of test (how much performance does the controller give over just the phy+pll?)
I'm going to call you the winner, based on:
- the UniPHY core often fails timing unless you massage map/fit options into your build and watch the fitter spin for 4x as long
- your core can run faster than 300 MHz
- easier to simulate and include in a design
Final throughputs are 506 MB/s for UniPHY, and 497 MB/s for your core. It's entirely possible there is some loss of throughput from my bridge having to buffer commands, so I'd be interested in hearing if you've ever done a similar type of test (how much performance does the controller give over just the phy+pll?)
Final throughputs are 506 MB/s for UniPHY, and 497 MB/s for your core.
Final throughputs are 506 MB/s for UniPHY, and 497 MB/s for your core. It's entirely possible there is some loss of throughput from my bridge having to buffer commands, so I'd be interested in hearing if you've ever done a similar type of test (how much performance does the controller give over just the phy+pll?)
Final throughputs are 506 MB/s for UniPHY, and 497 MB/s for your core. It's entirely possible there is some loss of throughput from my bridge having to buffer commands, so I'd be interested in hearing if you've ever done a similar type of test (how much performance does the controller give over just the phy+pll?)
It got a bit slower with BANK_ROW_ORDER = "BANK_ROW_COL", 1052606 microseconds for the whole RAM. My initial testing was with "ROW_BANK_COL". I'm not sure what should be the more appropriate setting for doing only large upward bursts.
I'm going to call you the winner, based on:
- the UniPHY core often fails timing unless you massage map/fit options into your build and watch the fitter spin for 4x as long
- your core can run faster than 300 MHz
- easier to simulate and include in a design