BrianHG_DDR3_CONTROLLER open source DDR3 controller. NEW v1.60.

#125 Reply
Posted by BrianHG on 20 Mar, 2022 17:48
Quote from: davemuscle on 20 Mar, 2022 16:35

Is this expected behavior?
Thanks.
You should not see this behavior. Pleave check that the setup time for the write command is placed ahead of the CMD_CLK. It looks as if my module didn't see your write, or, it took the write at this address: (see attached photo)

To be sure when simulating and sending commands, try offsetting the commands you send by 1/2 CMD_CLK phase so that you can see clearly what is being accepted during the 'rise' of the source clock.

Also, the way you are accessing the ram with the set 'write mask', make sure you have the port width set to 128 bits, otherwise nothing will write. You have only bits 96 through 127 write enabled.

#126 Reply
Posted by davemuscle on 20 Mar, 2022 18:29
Quote from: BrianHG on 20 Mar, 2022 17:24

If you are using the multiport module and the read and write channel are on the same CMD_xxx[ # ] bus, and smart cache is enabled, then you should receive the new data as long as there is 1 spare clock between the 2. ...

Writes to the DDR3 are held off until either a new write is sent outside the current cached address, or, the write cache timer has reached 0 due to no additional writes on that port. The current 'PORT_W_CACHE_TOUT' parameter default is set to 255 CMD_CLKS. ...

I'm only using one element in the CMD_* array. Relevant parameters are:
- PORT_PRIORITY = '{default:0}
- PORT_READ_STACK = '{default:4}
- PORT_W_CACHE_TOUT = '{default:0}
- PORT_CACHE_SMART = '{default:0}
- PORT_MAX_BURST = '{default:256}
- SMART_BANK = 0
Everything else is the default for the DECA example at 400 MHz.

#127 Reply
Posted by davemuscle on 20 Mar, 2022 18:43
Quote from: BrianHG on 20 Mar, 2022 17:48

You should not see this behavior. Pleave check that the setup time for the write command is placed ahead of the CMD_CLK. It looks as if my module didn't see your write, or, it took the write at this address: (see attached photo)

To be sure when simulating and sending commands, try offsetting the commands you send by 1/2 CMD_CLK phase so that you can see clearly what is being accepted during the 'rise' of the source clock.

I'm pretty certain the address is getting sampled by your block correctly. I can see from the memory model prints that when I input address 0x0000 it corresponds to Row/Bank/Col = 0. Just to be sure I inverted the clock going to my logic and got the same result. 'clk' runs my logic and 'tmp' runs your logic in the screenshot.

Quote from: BrianHG on 20 Mar, 2022 17:48

Also, the way you are accessing the ram with the set 'write mask', make sure you have the port width set to 128 bits, otherwise nothing will write. You have only bits 96 through 127 write enabled.

The port is set to 128-bits. I load the 128-bit words big-endian to match your controller, so CMD_wdata = 0x12345678 00000000 ... and CMD_wmask = 0xFFFF 0000 ....

Here you can see my writes/reads going into and out of the DDR3 successfully, note the final missing read operation due to the cache:

#128 Reply
Posted by BrianHG on 20 Mar, 2022 18:51
Quote from: davemuscle on 20 Mar, 2022 18:29
Quote from: BrianHG on 20 Mar, 2022 17:24

If you are using the multiport module and the read and write channel are on the same CMD_xxx[ # ] bus, and smart cache is enabled, then you should receive the new data as long as there is 1 spare clock between the 2. ...

Writes to the DDR3 are held off until either a new write is sent outside the current cached address, or, the write cache timer has reached 0 due to no additional writes on that port. The current 'PORT_W_CACHE_TOUT' parameter default is set to 255 CMD_CLKS. ...

I'm only using one element in the CMD_* array. Relevant parameters are:
- PORT_PRIORITY = '{default:0}
- PORT_READ_STACK = '{default:4}
- PORT_W_CACHE_TOUT = '{default:0}
- PORT_CACHE_SMART = '{default:0}
- PORT_MAX_BURST = '{default:256}
- SMART_BANK = 0
Everything else is the default for the DECA example at 400 MHz.
Warning, if 'PORT_CACHE_SMART is not set to '{default 1}, then you will be reading old stale data since the last read.

Enabling the PORT_CACHE_SMART means if a write has been done at any time, if there is a matching read address cached, that read cache data will immediately reflect what was written to the write cache even before the write data has been sent to the DDR3. This parameter should always be on unless you are trying to scrounge up 1 last logic cell on a full FPGA, or get that lat FMAX MHz.

Even with the 'PORT_W_CACHE_TOUT = '{default:0}', meaning a write will go out to the DDR3 ASAP, the DDR3 always operates at a delay since there is a ton of setup involved. My controller is trying to prevent unnecessary DDR3 access whenever possible.

#129 Reply
Posted by BrianHG on 20 Mar, 2022 19:04
If you are using my DDR3 V1.5, the:
PORT_READ_STACK [0:15] should be '{default:16} for maximum read speed when you stack a number of consecutive reads. Though, with 128bit and if you do not require serious random read stacked events, 4 is perfectly fine.

#130 Reply
Posted by davemuscle on 20 Mar, 2022 19:20
Quote from: BrianHG on 20 Mar, 2022 18:51

Warning, if 'PORT_CACHE_SMART is not set to '{default 1}, then you will be reading old stale data since the last read.

Enabling the PORT_CACHE_SMART means if a write has been done at any time, if there is a matching read address cached, that read cache data will immediately reflect what was written to the write cache even before the write data has been sent to the DDR3. This parameter should always be on unless you are trying to scrounge up 1 last logic cell on a full FPGA, or get that lat FMAX MHz.

Even with the 'PORT_W_CACHE_TOUT = '{default:0}', meaning a write will go out to the DDR3 ASAP, the DDR3 always operates at a delay since there is a ton of setup involved. My controller is trying to prevent unnecessary DDR3 access whenever possible.

In your comment for 'PORT_CACHE_SMART' you list disabling it for memory testing. I wanted to see each request go to the DDR3 without extra logic surrounding it. 'PORT_W_CACHE_TOUT' was disabled for a similar reason. Starting my development dumb & slow then improving it once the basics work.

I enabled the smart cache and it solved the particular test case, however the behavior was unexpected. I'll just leave it on for now.

#131 Reply
Posted by BrianHG on 20 Mar, 2022 19:26
Quote from: davemuscle on 20 Mar, 2022 19:20
Quote from: BrianHG on 20 Mar, 2022 18:51

Warning, if 'PORT_CACHE_SMART is not set to '{default 1}, then you will be reading old stale data since the last read.

Enabling the PORT_CACHE_SMART means if a write has been done at any time, if there is a matching read address cached, that read cache data will immediately reflect what was written to the write cache even before the write data has been sent to the DDR3. This parameter should always be on unless you are trying to scrounge up 1 last logic cell on a full FPGA, or get that lat FMAX MHz.

Even with the 'PORT_W_CACHE_TOUT = '{default:0}', meaning a write will go out to the DDR3 ASAP, the DDR3 always operates at a delay since there is a ton of setup involved. My controller is trying to prevent unnecessary DDR3 access whenever possible.

In your comment for 'PORT_CACHE_SMART' you list disabling it for memory testing. I wanted to see each request go to the DDR3 without extra logic surrounding it. 'PORT_W_CACHE_TOUT' was disabled for a similar reason. Starting my development dumb & slow then improving it once the basics work.

I enabled the smart cache and it solved the particular test case, however the behavior was unexpected. I'll just leave it on for now.

Note that if you do not need or want any features of my multiport module with the CMD_xxx interface, it is a waste of space and you will get much better performance just using my PHY controller. No cache, not smart, send a command and the DDR3 will do it ASAP, and around 1/2 the logic cells.

Example PHY only interface: https://github.com/BrianHGinc/BrianHG-DDR3-Controller/tree/main/BrianHG_DDR3_DECA_only_PHY_SEQ

The only thing is that your 400MHz controller will have a 200MHz interface only, no option for 100MHz quarter rate unless you use the 'toggle' enable & data ready feature which allows for alternate clock domain command interface.

Each enabled command will always be sent to the DDR3 regardless of address or repeats. But, you will no longer have the ability to add multiple read/write ports and you are stuck with 128bit.

#132 Reply
Posted by davemuscle on 02 Apr, 2022 02:22
Hi,

I'm trying to use the PHY_SEQ only connected to my custom code. I'm finding that sometimes the CMD_busy signals ends up sticking to 1 and locking all my upstream logic, but not the downstream logic (your block), which ends up performing the same write/read over and over again. This is all with TOGGLE_CONTROLs = 0.

Could the behavior I'm encountering be because of 'CMD_ena' and 'refresh_in_progress' assert at the same time? See the red highlight in pt1.png for that. In pt2.png you can see the busy signal get stuck with hopefully some extra surrounding info.

Also, I just need to make sure that when TOGGLE_CONTROLS=0 the CMD_ena and CMD_busy signals are analogous to something like AXI stream tvalid and tready. It seems like TOGGLE_CONTROLS=1 is your preferred style, would it be better to use that for driving the PHY?

#133 Reply
Posted by BrianHG on 02 Apr, 2022 03:01
Quote from: davemuscle on 02 Apr, 2022 02:22
I'm trying to use the PHY_SEQ only connected to my custom code. I'm finding that sometimes the CMD_busy signals ends up sticking to 1 and locking all my upstream logic, but not the downstream logic (your block), which ends up performing the same write/read over and over again. This is all with TOGGLE_CONTROLs = 0.
Note that with toggle controls at 0, the my 'CMD_BUSY' will go high is either the commands going in overflow the command stack, or, it will go high while an internal refresh request has been posted and it will stay high until the command has been added to the queue. Whenever the 'CMD_BUSY' is high, all input activity on the CMD_ENA is ignored.

My Modelsim for the internal behavior of this DDR3 command stack processor belong to my 'BrianHG_DDR3_CMD_SEQUENCER_tb.sv' and the '.do' batch file 'setup_seq.do' and 'run_seq.do'.

Quote
Could the behavior I'm encountering be because of 'CMD_ena' and 'refresh_in_progress' assert at the same time? See the red highlight in pt1.png for that. In pt2.png you can see the busy signal get stuck with hopefully some extra surrounding info.

If they are asserted at the same time, the refresh in progress should take priority, yet the I do assert the CMD_BUSY ahead by 1 clock so you know you should not be sending a command at that time.

Q: Did you wait long enough for the refresh to run through to see if your entered command came out the other end? A refresh on a 4gb DDR3 is something like 350ns. If you stacked a command or 2 in advance, the busy will stay high until those commands have finally been sent out in the neighborhood of 400ns later and don't forget there may be still a few commands in advance to pipe on through before the refresh begins. (One advantage to using my multiport is if there are repetitive commands, it runs then in the cache first before bothering with accessing the DDR3.)

Quote
Also, I just need to make sure that when TOGGLE_CONTROLS=0 the CMD_ena and CMD_busy signals are analogous to something like AXI stream tvalid and tready. It seems like TOGGLE_CONTROLS=1 is your preferred style, would it be better to use that for driving the PHY?

Sorry, I am unfamiliar with the 'AXI stream tvalid and tready'.

My toggle mode treats the CMD_ENA_t input like a command address [ 0 ]. So, each command you send, that address should increment in parallel.

The CMD_BUSY_t operates like a return address [ 0 ] telling you which command address has finished processing.

The idea is if your control device driving my DDR3 is running, for example at 100MHz instead of 200MHz, incrementing/toggling that CMD_ENA_t input with every new command is seen by my controller as 1 new command. Without toggle mode, pulsing the CMD_ENA at 100MHz will be seen as 2 consecutive commands by my 200MHz DDR3 core. On your device host side, you know you are clear to continue sending commands so long as 'CMD_ENA_t == CMD_busy_t'. You can say within your module:

wire DDR3_is_busy = !(my_out_reg_CMD_ENA_t == input_from_DDR3_phy_CMD_busy_t);

#134 Reply
Posted by BrianHG on 02 Apr, 2022 03:28
Ohh, 1 other thing about the refresh. After a power-on reset, or reset pulse, the DDR3 will begin to run for around 15 milliseconds before the first initial refresh commands come in. This is a one time thing after power-up and can be seen in some simulations. This does not generate any lost or missing data as the CMD_BUSY flag will properly run if needed. If no CMD_ENA commands are being sent, a small train of sequential refresh commands may run through, but, these additional ones may be interrupted by any CMD_ENA command you send as after the first one, the others are low priority.

#135 Reply
Posted by BrianHG on 02 Apr, 2022 03:43
I've attached my decoding of your logic waveform:
Do not worry about the internals inside my source. Wait until the actual commands are sent to the DDR3 and every command you CMD_ENA'ed while the CMD_BUSY was low will make it to the DDR3 when it is permitted due to DDR3 timing constraints and potential row and page selection as well as refresh.

Using the toggle mode =1, you may see how the CMD_ENA_t is toggled with each sent command while the CMD_BUSY_t return appears to you more like an ACKNOWLEDGE becoming equal to the CMD_ENA once a command is accepted. I do not know the internal working of the AXI system, but an acknowledge style interface may be easier to work with if you generate the toggle out on your side.

#136 Reply
Posted by davemuscle on 02 Apr, 2022 04:20
Quote from: BrianHG on 02 Apr, 2022 03:28
Ohh, 1 other thing about the refresh. After a power-on reset, or reset pulse, the DDR3 will begin to run for around 15 milliseconds before the first initial refresh commands come in. This is a one time thing after power-up and can be seen in some simulations. This does not generate any lost or missing data as the CMD_BUSY flag will properly run if needed. If no CMD_ENA commands are being sent, a small train of sequential refresh commands may run through, but, these additional ones may be interrupted by any CMD_ENA command you send as after the first one, the others are low priority.
I see a refresh occur around 15 microseconds, if that's what you mean. I won't be getting close to 15 ms with the free version of Modelsim, lol. The first two refreshes shortly complete, but then I get locked up with one that doesn't end. See screenshot.

I tried delaying the CMD_ena signal by a single cycle, to avoid it being asserted on the same edge as 'refresh_in_progress'. The sim was able to get farther than it usually does, until the same issue happened again. Do I need to deassert CMD_ena during a refresh?

Tell me if this is wrong, quick pseudo-code for the CMD_* bus:
if(state)
cmd_ena <= 1;
if(cmd_ena & !cmd_busy)
if(last_xfer)
cmd_ena <= 0;
state <= next_state

#137 Reply
Posted by BrianHG on 02 Apr, 2022 04:52
See attached image. The ddr3 is working fine.

Note that even though you set the use-toggle =0, the refresh in progress is an internal signal and it is always a toggle style signal. So, viewing it alone, you cannot see the true refresh request state. If you want to know the truth about the refresh, you need to make a :

wire busy_doing_a_refresh = ( refresh_req != refresh_in_progress );

Quote
if(state)
cmd_ena <= 1;
if(cmd_ena & !cmd_busy)
if(last_xfer)
cmd_ena <= 0;
state <= next_state

What are you trying to do?

it's more like:
if (!cmd_busy && I_need_to_access_ddr3) begin
CMD_xxx <= what to do
CMD_ENA <= 1;
state <= next_state;
else if (cmd_busy && I_need_to_access_ddr3) begin
state <= wait;
else if (!cmd_busy) begin
CMD_ENA <= 0;
state <= next_state;
end

Note that the state can be done as combinational logic saving a clock cycle.

wire state = (cmd_busy && I_need_to_access_ddr3) ? wait_state : next_state;

#138 Reply
Posted by BrianHG on 04 Apr, 2022 23:58
Ooopps, I just looked back at my code. I made a mistake in my above post.

The logic 'refresh_in_progress' is actually true logic, not toggle logic.

From what I can see, forcing the CMD_ENA indefinitely high has tied up my sequencer preventing the refresh request from taking place. The moment the CMD_ENA goes low, the next command entered into the command FIFO stack would be the refresh. I need to double check that this does not accidentally constitute a potential refresh violation. (Note that my coding counts the elapsed time of missed refreshes and will stream a continuous block or refreshes when it gets the chance to, maintaining the datasheet's recommended average refresh row count / maximum time period.) When using my multiport & it's toggle-enable set to 1, there is always room for a refresh to enter the queue.

#139 Reply
Posted by davemuscle on 09 Apr, 2022 03:45
Hey,
After developing my Avalon bridge to wrap your PHY+PLL, I decided some benchmarks were in order to compare against the Altera UniPHY IP.

All results below were obtained with my synthesizable Avalon memory tester that writes the entire DDR3 with random data then verifies it. It was configured with a 64-bit data bus and max burst of 256 to match what the UniPHY IP wanted. The memory tester is controlled via a separate JTAG-Avalon IP that measures how long the transaction takes with a TCL script. Both DDR3 instances were clocked at 300 MHz with a half-rate Avalon interface.

Altera UniPHY
Code: [Select]
*** Build Summary *** Total logic elements : 7,290 / 49,760 ( 15 % ) Total combinational functions : 6,391 / 49,760 ( 13 % ) Dedicated logic registers : 3,779 / 49,760 ( 8 % ) Total memory bits : 14,304 / 1,677,312 ( < 1 % ) Embedded Multiplier 9-bit elements : 0 / 288 ( 0 % ) Total PLLs : 1 / 4 ( 25 % ) Total pins : 65 / 360 ( 18 % ) *** Memory Test *** /devices/10M50DA(.|ES)|10M50DC@1#1-2#Arrow MAX 10 DECA/(link)/JTAG/(110:132 v1 #0)/phy_0/master Started memory test Finished memory test Microseconds recorded: 1011776 Number of passes : 0x20000000 Number of failures : 0x00000000 Number of ticks : 0x000f700d
Dave's Bridge + BHG PHY/PLL
Code: [Select]
*** Build Summary *** Total logic elements : 6,241 / 49,760 ( 13 % ) Total combinational functions : 3,264 / 49,760 ( 7 % ) Dedicated logic registers : 4,974 / 49,760 ( 10 % ) Total memory bits : 5,792 / 1,677,312 ( < 1 % ) Embedded Multiplier 9-bit elements : 0 / 288 ( 0 % ) Total PLLs : 1 / 4 ( 25 % ) Total pins : 63 / 360 ( 18 % ) *** Memory Test *** /devices/10M50DA(.|ES)|10M50DC@1#1-2#Arrow MAX 10 DECA/(link)/JTAG/(110:132 v1 #0)/phy_0/master Started memory test Finished memory test Microseconds recorded: 1030280 Number of passes : 0x20000000 Number of failures : 0x00000000 Number of ticks : 0x000fb950
Final throughputs are 506 MB/s for UniPHY, and 497 MB/s for your core. It's entirely possible there is some loss of throughput from my bridge having to buffer commands, so I'd be interested in hearing if you've ever done a similar type of test (how much performance does the controller give over just the phy+pll?)

I'm going to call you the winner, based on:
- the UniPHY core often fails timing unless you massage map/fit options into your build and watch the fitter spin for 4x as long
- your core can run faster than 300 MHz
- easier to simulate and include in a design

#140 Reply
Posted by BrianHG on 09 Apr, 2022 04:17
Quote from: davemuscle on 09 Apr, 2022 03:45
Hey,
After developing my Avalon bridge to wrap your PHY+PLL, I decided some benchmarks were in order to compare against the Altera UniPHY IP.

All results below were obtained with my synthesizable Avalon memory tester that writes the entire DDR3 with random data then verifies it. It was configured with a 64-bit data bus and max burst of 256 to match what the UniPHY IP wanted. The memory tester is controlled via a separate JTAG-Avalon IP that measures how long the transaction takes with a TCL script. Both DDR3 instances were clocked at 300 MHz with a half-rate Avalon interface.

Altera UniPHY
Code: [Select]
*** Build Summary *** Total logic elements : 7,290 / 49,760 ( 15 % ) Total combinational functions : 6,391 / 49,760 ( 13 % ) Dedicated logic registers : 3,779 / 49,760 ( 8 % ) Total memory bits : 14,304 / 1,677,312 ( < 1 % ) Embedded Multiplier 9-bit elements : 0 / 288 ( 0 % ) Total PLLs : 1 / 4 ( 25 % ) Total pins : 65 / 360 ( 18 % ) *** Memory Test *** /devices/10M50DA(.|ES)|10M50DC@1#1-2#Arrow MAX 10 DECA/(link)/JTAG/(110:132 v1 #0)/phy_0/master Started memory test Finished memory test Microseconds recorded: 1011776 Number of passes : 0x20000000 Number of failures : 0x00000000 Number of ticks : 0x000f700d
Dave's Bridge + BHG PHY/PLL
Code: [Select]
*** Build Summary *** Total logic elements : 6,241 / 49,760 ( 13 % ) Total combinational functions : 3,264 / 49,760 ( 7 % ) Dedicated logic registers : 4,974 / 49,760 ( 10 % ) Total memory bits : 5,792 / 1,677,312 ( < 1 % ) Embedded Multiplier 9-bit elements : 0 / 288 ( 0 % ) Total PLLs : 1 / 4 ( 25 % ) Total pins : 63 / 360 ( 18 % ) *** Memory Test *** /devices/10M50DA(.|ES)|10M50DC@1#1-2#Arrow MAX 10 DECA/(link)/JTAG/(110:132 v1 #0)/phy_0/master Started memory test Finished memory test Microseconds recorded: 1030280 Number of passes : 0x20000000 Number of failures : 0x00000000 Number of ticks : 0x000fb950
Final throughputs are 506 MB/s for UniPHY, and 497 MB/s for your core. It's entirely possible there is some loss of throughput from my bridge having to buffer commands, so I'd be interested in hearing if you've ever done a similar type of test (how much performance does the controller give over just the phy+pll?)

I'm going to call you the winner, based on:
- the UniPHY core often fails timing unless you massage map/fit options into your build and watch the fitter spin for 4x as long
- your core can run faster than 300 MHz
- easier to simulate and include in a design
Thanks a million for the verification and comparison.

One even bigger plus of my core is it can run @300MHz on a -8. Altera's Uniphy requires a -6 to run in software mode. Not to mention I support Cyclone III/IV which are missing differential DQS ports necessary for DDR3.

I'm deciding whether my next move will be to bring my design to Lattice ECP5 fpgas, or clean up my core to version 2.0 to gain a few more percentage performance points as well as further improve the robustness of fitting a design with a timing report all in the black. I know with Cyclone III & IV, you can achieve 450MHz with a timing report of 100% in the black, but some of the fitter options require a number of tweaks.

#141 Reply
Posted by BrianHG on 09 Apr, 2022 04:26
Quote from: davemuscle on 09 Apr, 2022 03:45
Final throughputs are 506 MB/s for UniPHY, and 497 MB/s for your core. It's entirely possible there is some loss of throughput from my bridge having to buffer commands, so I'd be interested in hearing if you've ever done a similar type of test (how much performance does the controller give over just the phy+pll?)

Note that my core essentially have a 4 command input fifo and to get read performance, the read results do come out way delayed due to the nature of DDR3 read setup, so you need to stream those read commands to get that perfect continuous unbroken consecutive burst.

When using my Multiport, it handles a lot of this work for you behind the scene if you use my default CMD_XXX parameter features enabled. If you got the my PHY only working, this shouldn't be a problem as the ports are compatible if you set the data bit width to the same number.

Even with the extra gates, it is still usefull to generate an Avalon interface running with my full controller as the extra ports allow sharing with my multi-window HDMI display engine which will receive commands through the Avalon port as it's display controls are addressable through all the available memory ports simultaneously.

#142 Reply
Posted by BrianHG on 09 Apr, 2022 04:47
Quote from: davemuscle on 09 Apr, 2022 03:45
Final throughputs are 506 MB/s for UniPHY, and 497 MB/s for your core.

Set to 300MHz with my multiport, I'm getting a throughput of ~1100MB/s. Note that this has been achieved running my video graphics adapter in 1080p mode with 2 translucent 32bit windows superimposed ontop of each other. Note that my controller take full advantage of large sequential bursts where my VGA controller bursts 4kb at a time per window. This performance should be matched if you were to generate ALU DSP modules, like FFT and convolution filters which may also burst in large linear chunks. This is at the edge of my controllers efficiency. Running the controller at 350MHz and above leaves enough room for other parallel tasks as well.

Note that with or without my Multiport, my PHY Only achieves the same performance. It is just the consecutive and large throughput nature of video which allows these speeds.

#143 Reply
Posted by BrianHG on 09 Apr, 2022 05:10
Quote from: davemuscle on 09 Apr, 2022 03:45
Final throughputs are 506 MB/s for UniPHY, and 497 MB/s for your core. It's entirely possible there is some loss of throughput from my bridge having to buffer commands, so I'd be interested in hearing if you've ever done a similar type of test (how much performance does the controller give over just the phy+pll?)

If you want, you can try the test again swapping my parameter 'BANK_ROW_ORDER'. Depending on how you are accessing the DDR3, it may help improve throughput.

#144 Reply
Posted by davemuscle on 09 Apr, 2022 05:35
It got a bit slower with BANK_ROW_ORDER = "BANK_ROW_COL", 1052606 microseconds for the whole RAM. My initial testing was with "ROW_BANK_COL". I'm not sure what should be the more appropriate setting for doing only large upward bursts.

My bridge is setup to stream commands as quickly as the Avalon port can give them and the PHY can take them. The slowest path would be when a read is requested and the FIFO for decoding the returning BL8 is near full. In the future I'll consider branching my memory tester to talk straight with your PHY and check for a speed difference, then naturally run it against the controller too. I'll have to think about this 1100 MB/s figure a bit more.

#145 Reply
Posted by BrianHG on 09 Apr, 2022 05:38
Quote from: davemuscle on 09 Apr, 2022 03:45
Final throughputs are 506 MB/s for UniPHY, and 497 MB/s for your core. It's entirely possible there is some loss of throughput from my bridge having to buffer commands, so I'd be interested in hearing if you've ever done a similar type of test (how much performance does the controller give over just the phy+pll?)

In half-rate mode using my full controller with the Multiport at 64bit, you should be approximately doubling your throughput. However, you need to use my default parameters. This means having a read stack set to 16 and write cache timeout set to 255, ect...

In my HDL comments, when I said if you were making a 'memory testing algorythm', I meant if you were trying to test the ram chip's memory cells, not the integrity of my controller.

My multiport is designed to squeeze together 2 consecutive 64bit chunks into more efficient 128bit packets for my controller. So long as Avalon can perform back-to-back reads or writes at 150MHz at 64bit, my multiport will do the lifting for you.

#146 Reply
Posted by BrianHG on 09 Apr, 2022 05:57
Quote from: davemuscle on 09 Apr, 2022 05:35
It got a bit slower with BANK_ROW_ORDER = "BANK_ROW_COL", 1052606 microseconds for the whole RAM. My initial testing was with "ROW_BANK_COL". I'm not sure what should be the more appropriate setting for doing only large upward bursts.

With BANK_ROW_COL mode, if you divide your ram into 2/4/8 chunks and with my multiport, you assign for example 1 cpu onto bank 0, video onto 1&2, sound onto bank 3, Having the bank at the top of the address space means as each peripheral accesses it's own region of memory, that bank is remembered and kept open and as other peripherals access their own memory regions, their banks are opened and closed only as necessary. It almost makes it as if you have 8 separate ram controllers.

This also helps if you are copying or processing huge sequential chunks of ram from an upper bank to a lower one as my ran controller knows to keep the 2 different section's rows simultaneously open during the transfer eliminating all the precharge and activate commands which would normally happen after each BL8. Now, the precharge and activate only happens when a new row is required in either or both sections of ram you may be copying to and from.

#147 Reply
Posted by BrianHG on 09 Apr, 2022 18:30
Quote from: davemuscle on 09 Apr, 2022 03:45
I'm going to call you the winner, based on:
- the UniPHY core often fails timing unless you massage map/fit options into your build and watch the fitter spin for 4x as long
- your core can run faster than 300 MHz
- easier to simulate and include in a design
You forgot the largest point.
IT'S FREE!!! and opensource.

#148 Reply
Posted by davemuscle on 11 Apr, 2022 02:44
Assuming that all the TOGGLE_* parameters are kept the same, can the controller be used as a drop-in replacement for the PHY+PLL?

I've created a wrapper that allows you to switch between the two, kept all other code constant, and my memory tester locks up on the controller version but not the PHY version. I made sure to use TOGGLE_OUTPUTS = 1, and TOGGLE_INPUTS = '{default:1} for the controller parameters to match my TOGGLE_CONTROLS = 1 for the PHY setup.

Since the test never completes, I assume I'm encountering the 'long refresh' that made me switch from the controller to the PHY in the first place. Can you confirm the timing diagram for toggle-mode below? That's what it looks like for the PHY setup, but for the controller setup CMD_busy toggles a cycle earlier, in a combinatorial way. This makes me think there are some differences with the front-end interface.

#149 Reply
Posted by BrianHG on 11 Apr, 2022 03:16
Ok, one of the features of my Multiport module is that it was designed to use positive enable logic and convert it's output to the toggle which my phy module prefers.

Looking at my basic example: BrianHG_DDR3_DECA_Show_1080p_v15_375Mhz_HR/BrianHG_DDR3_DECA_top.sv,
The instantiation of the: 'BrianHG_DDR3_CONTROLLER_v15_top'

(*** Careful, use the V15 versions here...)

The parameter array '.PORT_TOGGLE_INPUT (PORT_TOGGLE_INPUT),' will allow you to set a selection of which CMD_xxx [ # ] ports into a toggle mode which should operate virtually identical to my core's 'BrianHG_DDR3_PHY_SEQ.sv' in it's toggle mode. Note that my PHY module's 'USE_TOGGLE_CONTROLS' is no longer accessible.

When using the toggle mode, every toggle can happen every single clock and the command will be accepted every single clock the toggle has taken place. It would be the same if you disabled the .PORT_TOGGLE_INPUT for that port # and left the CMD_ena high for every clock. The difference is how the busy and return will work. In toggle mode, you can keep sending a toggle command every clock as long as the (CMD_busy == CMD_ena). Every time the CMD_read_ready toggles, you know a new read word and new read vector out is ready. With toggle disabled, the CMD_read_ready will be high when new valid data is ready, otherwise it is low.

It is at this point where I say if you are using my full controller, you are better off disabling the toggle option and use the plain enable true/false logic. My original toggle feature was to allow my core to run at for example 200MHz while running my multiport at 100MHz or 50MHz, or 400MHz. The interface between the 2 with the toggle feature allow for any type of clock frequency crossing without added headaches. I added the toggle feature to the multiport's CMD_xxx ports as an afterthought in case someone wanted to interface with slower or faster logic, but I have not extensively tested it.