Author Topic: OpenHBMC. Open-source AXI4-based HyperBus memory controller  (Read 641 times)

0 Members and 1 Guest are viewing this topic.

Offline VGN

  • Regular Contributor
  • *
  • Posts: 79
  • Country: am
OpenHBMC. Open-source AXI4-based HyperBus memory controller
« on: September 23, 2020, 04:05:17 am »
Hello everyone!

I'm developing a high performance AXI4-based HyperBus memory controller for Xilinx 7-series FPGAs.
This memory controller is a small part of my another project: https://www.eevblog.com/forum/thermal-imaging/openirv-isc0901b0-(autoliv-nv3-flir-e4568)-based-opensource-thermal-camera/
Anyway, I thought that, probably, it is worth to publish it as a separate project.

This IP-core is packed for Vivado 2020.1 for easy block design integration, though you can use raw HDL files.

This is a first release. I haven't tested it well, but could successfully pass continues memory test at ~162MHz, i.e. 325MB/s on a custom devboad with Spartan-7 and a single HyperRAM. I'm going to beat the level of 200MHz (400MB/s), as soon as I get a board with HyperRAM capable 200MHz, as right now, the memory I have is limited by 166MHz. Soon I'm going to test long burst transfers on hardware, a DMA is needed for this, as Microblaze cannot initate long burst transfers.

Resource utilization: 565 LUT, 678 FF, 1.5 BRAM (1.5 x RAMB36E1 = 3 x RAMB18E1)

Feel free to ask question, criticize the design, report bugs and donate if you like this IP-core)

Link to repo: https://github.com/OVGN/OpenHBMC
« Last Edit: September 23, 2020, 04:12:30 am by VGN »
 
The following users thanked this post: KE5FX, asmi

Online KE5FX

  • Super Contributor
  • ***
  • Posts: 1285
  • Country: us
    • KE5FX.COM
Re: OpenHBMC. Open-source AXI4-based HyperBus memory controller
« Reply #1 on: September 23, 2020, 06:50:30 am »
Good stuff.   :-+ I found it surprisingly hard to get my own HyperRAM state machine working.  The data sheets are pretty awful and the interface itself is more complicated than it has any right to be.  (What's up with all the fixed-versus-variable latency cruft, for instance... why can't I just poll RWDS to find out when the device is ready to read or write...?)
 

Offline asmi

  • Super Contributor
  • ***
  • Posts: 1413
  • Country: ca
Re: OpenHBMC. Open-source AXI4-based HyperBus memory controller
« Reply #2 on: September 23, 2020, 02:09:41 pm »
Good stuff.   :-+ I found it surprisingly hard to get my own HyperRAM state machine working.  The data sheets are pretty awful and the interface itself is more complicated than it has any right to be.  (What's up with all the fixed-versus-variable latency cruft, for instance... why can't I just poll RWDS to find out when the device is ready to read or write...?)
I found it very easy and documentation very good and helpful. Had no problems implementing interface. As for latency, by default the memory starts up in fixed double-latency mode, so you don't have to worry about it at all. This is also the only possible mode for dual-die 128M chips. The option to turn on variable latency is there in case you want to trade lower latency for increased controller complexity.
Fixed latency is also great when you want to scale your memory interface horizontally by using multiple chips in parallel lockstep, so effectively getting a multiple of bandwidth (2x for using 2 chips, 4x for 4, etc). This is necessary for example when you want to use these chips as a frame buffer - 720p@60Hz requires 1280*720*4*60~211 MB/s of bandwidth, and you will need at least double of that (so that you can read and write at the same time). In my design I used a pair of chips in parallel for 720p, for 1080p@60Hz (~475 MB/s) you will want 3 or 4 chips.
That said, I gotta say I prefer using LPDDR1 modules for smaller designs when using MIG is not an option due to its' rather high resource requirements. These chips are available in variety of capacities from 128Mbit to 2Gbit with x16 or x32 data bus, can go up to 200 MHz DDR at CL3, and it's protocol is quite simple to implement (though not as simple as hyperbus).

Offline VGN

  • Regular Contributor
  • *
  • Posts: 79
  • Country: am
Re: OpenHBMC. Open-source AXI4-based HyperBus memory controller
« Reply #3 on: September 23, 2020, 04:10:34 pm »
Good stuff.   :-+ I found it surprisingly hard to get my own HyperRAM state machine working.  The data sheets are pretty awful and the interface itself is more complicated than it has any right to be.  (What's up with all the fixed-versus-variable latency cruft, for instance... why can't I just poll RWDS to find out when the device is ready to read or write...?)
Thanks! Yes, the interface is a bit complicated, but not as much for me. Ha-ha, polling the RWDS to find out when the device is ready is the next generation DRAM, we should wait for this another 10 years))

Had no problems implementing interface.
It is very interested how did you solve the problem of data transfer from RWDS clock domain to internal FPGA clock?
 

Offline asmi

  • Super Contributor
  • ***
  • Posts: 1413
  • Country: ca
Re: OpenHBMC. Open-source AXI4-based HyperBus memory controller
« Reply #4 on: September 23, 2020, 05:09:10 pm »
It is very interested how did you solve the problem of data transfer from RWDS clock domain to internal FPGA clock?
I used ISERDES in memory mode and fed rwds clock to CLK/CBLB pins of SERDES, while feeding free-running clocks from MCMM to OCLK/OCLKB/CLKDIV pins as described in UG471. I also considered using IN_FIFO to help close timing (it can act as 1:2 deserializer as well as regular FIFO, which is how it's used for DDR2/3 controllers), but for -2 device and 166 MHz this proved unnecessary. With this arrangement (no IN_FIFO) I was able to close timing at 200 MHz DDR for LPDDR1 x16 controller for -2 device by carefully choosing which IO pins to use for dq and dqs signals, but I suspect that for -1 device those FIFOs might be necessary. So I think it shouldn't be very hard to get new 200 MHz HB2 chips to work. Also note that my design used 2 64Mbit chips in parallel, so I had 2 parallel HB buses and so 2 separate dqs clocks.
I love how you went straight to the crux of the problem, it shows that you invested quite a bit of time getting this to work! Commendable :-+ :clap:
« Last Edit: September 23, 2020, 05:12:11 pm by asmi »
 
The following users thanked this post: VGN

Online KE5FX

  • Super Contributor
  • ***
  • Posts: 1285
  • Country: us
    • KE5FX.COM
Re: OpenHBMC. Open-source AXI4-based HyperBus memory controller
« Reply #5 on: September 24, 2020, 12:42:01 am »
Ha-ha, polling the RWDS to find out when the device is ready is the next generation DRAM, we should wait for this another 10 years

My impression is that they could have done just that if they hadn't been forced to overload RWDS as a byte mask for writes.  In the simplest case, where I write a single 16-bit word, I ended up with this goofy logic:

   
Code: [Select]
RAM_M_WR:
            begin
               case (RAM_edge_cnt)
                  0:  RAMDQ_OUT_reg <= RAM_CA_words[0];   // Write addr [47:40] prior to first rising edge
                  1:  RAMDQ_OUT_reg <= RAM_CA_words[1];   // 5 ns after first rising edge (and 5 ns prior to first falling edge): write addr[39:32]
                  2:  RAMDQ_OUT_reg <= RAM_CA_words[2];   // 5 ns after first falling edge: write addr[31:24]
                  3:  RAMDQ_OUT_reg <= RAM_CA_words[3];   // 5 ns after second rising edge: write addr[23:16]
                  4:  RAMDQ_OUT_reg <= RAM_CA_words[4];   // 6-clock tACC initial latency period began 5 ns ago at edge 4, after row address is known   
                  5:  RAMDQ_OUT_reg <= RAM_CA_words[5];   // This period is 6 cycles long, ending at edge 16 when the fixed additional latency period begins
                  27: RAMRWDS_Z_reg <= 1'b0;              // Write-enable RWDS to unmask byte writes (RWDS_OUT_reg is 0)
                  28: RAMDQ_OUT_reg <= BYTE_1;            // End of additional 6-cycle latency period that began at edge 16: start writing 16-bit word
                  29:
                      begin
                        RAMDQ_OUT_reg  <= BYTE_0;         // Write rest of 16-bit word
                        RAM_state      <= RAM_END_CMD;
                      end
               endcase
            end
   
What would have made more sense:

Code: [Select]
         
RAM_M_WR:
            begin
               case (RAM_edge_cnt)
                  0:  RAMDQ_OUT_reg <= RAM_CA_words[0];   // Write addr [47:40] prior to first rising edge
                  1:  RAMDQ_OUT_reg <= RAM_CA_words[1];   // 5 ns after first rising edge (and 5 ns prior to first falling edge): write addr[39:32]
                  2:  RAMDQ_OUT_reg <= RAM_CA_words[2];   // 5 ns after first falling edge: write addr[31:24]
                  3:  RAMDQ_OUT_reg <= RAM_CA_words[3];   // 5 ns after second rising edge: write addr[23:16]
                  4:  RAMDQ_OUT_reg <= RAM_CA_words[4];   
                  5:  RAMDQ_OUT_reg <= RAM_CA_words[5];
                  6:  begin
                           RAMDQ_OUT_reg <= BYTE_1;       // Put first byte on data bus
                           RAM_state <= RAM_WRITE_B0;     // Wait until RWDS indicates that it was accepted
                      end
               endcase
            end

RAM_WRITE_B0:
            begin
               if (RWDS) begin
                  RAMDQ_OUT_reg <= BYTE_0;                // First byte accepted; now write second byte
                  RAM_state     <= RAM_END_CMD;           // (RWDS was asserted for only one cycle, no need to spin on it)
               end     
            end

I have to believe that masked byte writes are an obscure corner case with this type of RAM, which is targeted primarily at framebuffers and FIFOs and the like.  I certainly don't need them myself.  They should've been handled with an entirely separate mode, instead of balkanizing all memory operations into fixed and variable latency versions. 

The fact that the whole DCARS option was necessary is an indication that the bus interface design should have been thought through a little further.
 

Offline asmi

  • Super Contributor
  • ***
  • Posts: 1413
  • Country: ca
Re: OpenHBMC. Open-source AXI4-based HyperBus memory controller
« Reply #6 on: September 24, 2020, 04:33:27 pm »
Soon I'm going to test long burst transfers on hardware, a DMA is needed for this, as Microblaze cannot initate long burst transfers.
Forgot to mention - if you turn on data and instruction caches in Microblaze, it will issue burst transactions as it will try to read/commit entire cache line in a single operation. This obviously won't create a sustained load on memory (unless you write the code such that it will force constant cache evictions), but it will allow you to verify in hardware that bursts are working as they should. You can also use AXI Traffic Generator IP to generate any traffic on AXI bus you desire, and it also has presets to mimic some typical use cases like video streaming, PCIE, Ethernet traffic, etc.

Offline VGN

  • Regular Contributor
  • *
  • Posts: 79
  • Country: am
Re: OpenHBMC. Open-source AXI4-based HyperBus memory controller
« Reply #7 on: September 24, 2020, 10:50:14 pm »
I have to believe that masked byte writes are an obscure corner case with this type of RAM, which is targeted primarily at framebuffers and FIFOs and the like.  I certainly don't need them myself.  They should've been handled with an entirely separate mode, instead of balkanizing all memory operations into fixed and variable latency versions.
To be honest, I don't see any problem with variable latency. Yes, it is weird a bit, but pretty easy to handle.

The fact that the whole DCARS option was necessary is an indication that the bus interface design should have been thought through a little further.
But DCARS is not necessary and not mandatory. More over, I doubt that you can purchase parts that support DCARS mode.
 

Offline VGN

  • Regular Contributor
  • *
  • Posts: 79
  • Country: am
Re: OpenHBMC. Open-source AXI4-based HyperBus memory controller
« Reply #8 on: September 24, 2020, 11:21:08 pm »
Forgot to mention - if you turn on data and instruction caches in Microblaze, it will issue burst transactions as it will try to read/commit entire cache line in a single operation. This obviously won't create a sustained load on memory (unless you write the code such that it will force constant cache evictions), but it will allow you to verify in hardware that bursts are working as they should. You can also use AXI Traffic Generator IP to generate any traffic on AXI bus you desire, and it also has presets to mimic some typical use cases like video streaming, PCIE, Ethernet traffic, etc.
Thanks, vivado provides some AXI Verification IP (VIP), I'm going to try it. BTW, initiating single transfers 8/16/32-bit manually over microblaze is easy, but is there any way to initiate wrapped cache transfer manually?


I used ISERDES in memory mode and fed rwds clock to CLK/CBLB pins of SERDES, while feeding free-running clocks from MCMM to OCLK/OCLKB/CLKDIV pins as described in UG471.
You didn't mention the IDELAY. Am I right, that you also had used IDELAY to delay the RWDS strobe before feeding it to CLK/CBLB pins of SERDES?
You have quite interesting implementation, though I don't understand how does it work stable, as your free-running clocks are still not synchronouse to the rwds domain. I don't understand how data is synchronized when going through the CLK->OCLK path inside the ISERDES in MEMORY_MODE...
 

Offline asmi

  • Super Contributor
  • ***
  • Posts: 1413
  • Country: ca
Re: OpenHBMC. Open-source AXI4-based HyperBus memory controller
« Reply #9 on: September 25, 2020, 04:32:04 am »
Thanks, vivado provides some AXI Verification IP (VIP), I'm going to try it.
I use it a lot, and it's great for simulating all kinds of AXI transactions!

BTW, initiating single transfers 8/16/32-bit manually over microblaze is easy, but is there any way to initiate wrapped cache transfer manually?
I honestly don't remember off top of my head. Try skimming through the user guide for MB - I remember it is very detailed so I'm sure it covers everything.

You didn't mention the IDELAY. Am I right, that you also had used IDELAY to delay the RWDS strobe before feeding it to CLK/CBLB pins of SERDES?
You may or may not need them depending on what your pinout looks like and which clock buffers you use. I don't have that design handy so I can't look it up, but I remember that clock buffers have rather large insertion delay, such that you might need to add IDELAY to data lines to compensate for that. In my 200 MHz LPDDR1 design I have IDELAY blocks on both clock and data lines. BTW - in case you need more resolution for delay, there is undocumented IDELAYE2_FINEDELAY component, which can provide finer delays (though that is only really useful at much higher speeds, because documented version has a resolution of 78 ps at 200 MHz).

You have quite interesting implementation, though I don't understand how does it work stable, as your free-running clocks are still not synchronouse to the rwds domain. I don't understand how data is synchronized when going through the CLK->OCLK path inside the ISERDES in MEMORY_MODE...
This mode has been specifically designed to support strobe-based memory interfaces (which is what DDRx interfaces are, and so is HB1/2 as far as reads are concerned). They provide functional schematics of how it works in UG471 (it's basically a version of a classic double-flop synchronization CLK->OCLK->CLKDIV, latter two are required to be phase-aligned), but you can just assume that it does, since this is a hard silicon block, provided that you use it exactly how they are suggesting. Carefully read entire section of that user guide which talks about ISERDESE2, you might need to read it several times to "get it" - as was in my case - because it's not exactly the cleanest explanation, and the component itself is rather complex and covers many different use cases.
As for stability - this is what IO timing constraints are for. If you constrain your design properly, timing analysis can guarantee proper functioning of interface in all conditions as Vivado will adjust placement and routing to help satisfy them (basically it can add quite a bit of delay by choosing a longer connection path inside the fabric's interconnect backbone, or move components around to ensure both setup and hold constraints are satisfied). As you increase your interface speed, at some point timing margins will become so small that you won't be able to perform a static capture using constraints, and in this case you will need to implement/perform some sort of calibration to ensure that you sample data right in the middle of the data eye, but in my experience with right pinout and some effort on your part you can successfully achieve static capture at 200 MHz DDR and below.
Finally - I'm sure you know this, but someone reading this might not - Cypress provides simulation models for these chips, so you can run full functional simulations to make sure it works like it should in all cases and scenarios.
« Last Edit: September 25, 2020, 04:33:48 am by asmi »
 


Share me

Digg  Facebook  SlashDot  Delicious  Technorati  Twitter  Google  Yahoo
Smf