Author Topic: MoSys Quazar QPR memory - RAM access via transceivers  (Read 1475 times)

0 Members and 1 Guest are viewing this topic.

Offline asmiTopic starter

  • Super Contributor
  • ***
  • Posts: 2860
  • Country: ca
MoSys Quazar QPR memory - RAM access via transceivers
« on: November 24, 2020, 04:19:58 pm »
Just came across this product: https://mosys.com/products/quazar-family/quazar-qpr-quad-partition-rate-memories/ (before anybody get any ideas - I'm not affiliated with the company in any way, and I doubt they will even give me few samples of these things 'cause they are expensive).
The basic idea is to tunnel all memory traffic through a bunch of 10G+ transceivers. This idea came to my mind before, but I didn't think there were any existing devices that were actually doing it.
Now, the reason that specific device is expensive is because it's super-high speed dual-port SRAM, but I would like to discuss the basic idea. What if instead of SRAM there were a bunch of DDRx dies, or - better yet - an HBM2 stack or two? The advantages are quite obvious - routing 4-8-16 10G differential pairs is orders of magnitude easier than x64 DDR3/4 memory interfaces, entire solution will be much more compact and likely more power-efficient (DDRx waste quite a bit of power for termination alone, nevermind anything else!). I wonder if anyone did any experiments in this area. I know in Xilinx world there is a free IP to tunnel AXI4 bus requests between FPGAs using variety of physical transports (including transceivers), so I wonder if anyone has any experience to share.
 
The following users thanked this post: SiliconWizard

Offline Boscoe

  • Frequent Contributor
  • **
  • Posts: 285
Re: MoSys Quazar QPR memory - RAM access via transceivers
« Reply #1 on: November 24, 2020, 09:25:12 pm »
I’m no expert and I’m not entirely sure I understand your suggestion but I imagine it’s because transceivers take up a lot of die area increasing cost. I imagine custom packaged memories with some logic and their own transceivers would be costly to develop and too specific to fit many applications. You can achieve the same thing with many DDRx memories and the internal logic of the FPGA, why add in a load of 10G transceivers as well? For ease of development? I think by this point you’ve already got seven figure funding and so it’s not really a problem.
 

Online ejeffrey

  • Super Contributor
  • ***
  • Posts: 4032
  • Country: us
Re: MoSys Quazar QPR memory - RAM access via transceivers
« Reply #2 on: November 24, 2020, 10:38:10 pm »
I think the biggest problem would be latency.  Transceivers typically add a minimum of a few cycles of delay in the parallel clock domain on each end.  10 ns of latency added to a high speed network interface is nothing, but for DRAM every nanosecond counts.  In addition, most tranceiver based interfaces are packetized, and packet processing would add additional latency and overhead for the small transactions common with RAM.  Rambus DRAM back in the P4 era used a fast, narrow (but parallel) bus with packetized traffic and suffered latency compared to the DDR SDRAM it competed with.

I'm not sure how power would come out.  10G+ Transceivers aren't exactly low power devices, and you would probably still have a smaller parallel bus at the DRAM side.  In principle each memory chip could have an on-chip controller and a dedicated lane, but that would be inflexible, and also maybe make latency worse depending on how you stripe your data.  I guess you would more likely end up with a memory controller on each memory module with ~4 lanes dedicated to it and a parallel array of DRAM chips connected to that controller.  My guess then is that any power savings on the DRAM parallel interface would be more than cancelled by the transceivers.

I do think it could be useful for some applications that aren't super latency sensitive, especially if you need only moderate bandwidth -- even down to a single lane.
 

Offline asmiTopic starter

  • Super Contributor
  • ***
  • Posts: 2860
  • Country: ca
Re: MoSys Quazar QPR memory - RAM access via transceivers
« Reply #3 on: November 24, 2020, 11:07:53 pm »
I think the biggest problem would be latency.  Transceivers typically add a minimum of a few cycles of delay in the parallel clock domain on each end.  10 ns of latency added to a high speed network interface is nothing, but for DRAM every nanosecond counts.  In addition, most tranceiver based interfaces are packetized, and packet processing would add additional latency and overhead for the small transactions common with RAM.  Rambus DRAM back in the P4 era used a fast, narrow (but parallel) bus with packetized traffic and suffered latency compared to the DDR SDRAM it competed with.
Latency is rarely a problem for FPGA designs - as long as it's consistent, you can always pipeline your design to accommodate latency. Bandwidth and throughput have always been more important in my designs than latency. As a matter of fact, the only time I can remember latency to be critical was in the ALU of my RV64I core. In all other cases the limiting factor was usually bandwidth, not even FPGA resources.

I'm not sure how power would come out.  10G+ Transceivers aren't exactly low power devices, and you would probably still have a smaller parallel bus at the DRAM side.  In principle each memory chip could have an on-chip controller and a dedicated lane, but that would be inflexible, and also maybe make latency worse depending on how you stripe your data.  I guess you would more likely end up with a memory controller on each memory module with ~4 lanes dedicated to it and a parallel array of DRAM chips connected to that controller.  My guess then is that any power savings on the DRAM parallel interface would be more than cancelled by the transceivers.
The idea is to have a system in a package, when DRAM dies are connected to the "controller" die via interposer. This way you can have your DDRx dies running at quite high frequencies without the need for termination because the lines are going to be extremely short (following 1/10 rule).
So the conceptual solution will be the same as the product I've linked in the OP - you will have FPGA connected to a memory device via a handful of serial links and so you will have 16 or 32 traces to route as opposed to 120+ for a typical DDR3/4 interface.

Offline Daixiwen

  • Frequent Contributor
  • **
  • Posts: 367
  • Country: no
Re: MoSys Quazar QPR memory - RAM access via transceivers
« Reply #4 on: November 25, 2020, 07:45:46 am »
Latency becomes a problem if you are doing a lot of random accesses in memory. However if you are only doing continuous accesses then using transceivers could give you better performance.
A similar idea was used in the late 90s with the Rambus interface on the Pentium 4 machines (not actually using transceivers, but replacing the parallel bus with a half serial protocol using fewer signals). Even if in theory the bandwidth was higher than SDR, in practise Rambus systems were a lot slower in most practical uses, because of the added latency, and it was abandoned, at least in PCs.
 

Offline asmiTopic starter

  • Super Contributor
  • ***
  • Posts: 2860
  • Country: ca
Re: MoSys Quazar QPR memory - RAM access via transceivers
« Reply #5 on: November 25, 2020, 11:08:04 am »
Latency becomes a problem if you are doing a lot of random accesses in memory. However if you are only doing continuous accesses then using transceivers could give you better performance.
A similar idea was used in the late 90s with the Rambus interface on the Pentium 4 machines (not actually using transceivers, but replacing the parallel bus with a half serial protocol using fewer signals). Even if in theory the bandwidth was higher than SDR, in practise Rambus systems were a lot slower in most practical uses, because of the added latency, and it was abandoned, at least in PCs.
And yet modern PCs use DDR4 memory with the best-case latency of about 15 (if reading a row that's already open) and a typical latency of 30-40 cycles.
You've got to remember that Rambus competed in very different era than what we have today. It was ahead of it's time. But it looks like now its' time is coming. And the reason I think so is design of LPDDR and especially GDDR5/6, which both trade large (in case of GDDR - huuuge) latency for extra bandwidth. If you look at LPDDR2 datasheet (the most recent version of LPDDR which has publicly accessible datasheets), you will see that they use address/control bus in DDR mode as opposed to SDR (like it is used in "normal" DDR2/3/4) and this way use only 10 pins. As far as I know, more recent versions of LPDDR (3, 4 and 4X) take this to the next level with using even less address/control pins such that command now takes more than a cycle to clock in. So I think the only thing that prevents desktop class CPUs from using these high-latency high-bandwidth memories is the fact that they have to support connectors, which seriously limit maximum achievable frequencies for SI reasons.

Now here is the thing - those of you who are old enough will recall that this problem has already been solved in the past with PCI bus. Back in the old days "classic" PCI bus has the exact same problem of physical design (multi-drop parallel bus) limiting available bandwidth. How was that problem solved? Yep, by serializing the protocol and using several multi-gigabit serial lanes instead of a crapton of parallel ones. Meet the PCI Express. That bus has proven that you can reach very high speeds even through connectors (16 Gbps per lane for PCI Express 4, 32 Gbps for PCI Express 5).

And this is why I think the future is with more "serialized" memories. Maybe they will never go full serial like PCI Express, but looking at where the bleeding edge memories (GDDR6) are, it seems fairly logical to think this way. As for random access - it is becoming less and less relevant with every further iteration of DDR/LPDDR/GDDR because it's relatively easy to work around by using cache, while insufficient bandwidth is a brick wall that can't be circumvented to any meaningful degree. A good case study is the current generation of graphic cards - NVidia solved the problem head-on with inventing GDDR6X memory that gives extra bandwidth it needs, while AMD has tried to work around that with massive on-chip cache. Guess what - at 1440p and below that seems to be working for AMD, but at 4k they suffer from insufficient bandwidth and lose to NVidia cards. And I suspect at 8K AMD cards will lose even more badly because all tricks and smarts with the cache will only get you so far, and at some point you just got to face the problem head-on as opposed to working around it.

Offline Gribo

  • Frequent Contributor
  • **
  • Posts: 649
  • Country: ca
Re: MoSys Quazar QPR memory - RAM access via transceivers
« Reply #6 on: November 25, 2020, 12:41:52 pm »
https://en.wikipedia.org/wiki/Fully_Buffered_DIMM

For some reason, both Intel and AMD abandoned this concept. The problem with RAM is that the access is random, caches mitigate this a bit, but most of the general computing access patterns are still random.
I am available for freelance work.
 

Offline asmiTopic starter

  • Super Contributor
  • ***
  • Posts: 2860
  • Country: ca
Re: MoSys Quazar QPR memory - RAM access via transceivers
« Reply #7 on: November 25, 2020, 01:43:57 pm »
https://en.wikipedia.org/wiki/Fully_Buffered_DIMM

For some reason, both Intel and AMD abandoned this concept. The problem with RAM is that the access is random, caches mitigate this a bit, but most of the general computing access patterns are still random.
I think the reason is the same as for Rambus - it was ahead of it's time. CPUs of that era didn't have massive L3 caches which can provide enough storage capacity for typical applications' memory access patterns. Outside of purpose-crafted benchmarks, typical applications' access patterns are not that random, so with enough cache you can successfully work with large memory latencies. Which is what's happening now, as DDR4 has quite large latency, and LPDDR4's latency is even higher.

Online SiliconWizard

  • Super Contributor
  • ***
  • Posts: 15795
  • Country: fr
Re: MoSys Quazar QPR memory - RAM access via transceivers
« Reply #8 on: November 25, 2020, 03:25:38 pm »
Latency becomes a problem if you are doing a lot of random accesses in memory.

Yes, and that's why we use caches.
 

Online ejeffrey

  • Super Contributor
  • ***
  • Posts: 4032
  • Country: us
Re: MoSys Quazar QPR memory - RAM access via transceivers
« Reply #9 on: November 25, 2020, 04:39:19 pm »
And yet modern PCs use DDR4 memory with the best-case latency of about 15 (if reading a row that's already open) and a typical latency of 30-40 cycles.
You've got to remember that Rambus competed in very different era than what we have today. It was ahead of it's time. But it looks like now its' time is coming. And the reason I think so is design of LPDDR and especially GDDR5/6, which both trade large (in case of GDDR - huuuge) latency for extra bandwidth.

I don't think that is really correct.  DDR2/3/4 aren't trading latency for bandwidth.  They are increasing bandwidth at (roughly) constant latency.  Sure the early versions of each new standard typically have a couple ns worse first-word latency than the first of a new generation, but generally they recover that fairly quickly.  In some ways, you can think of DDR4 as a sort of weird SERDES already: a bunch of DRAM pages interleaved with phase-shifted clocks.  It just runs at "only" 3-4 GT/s rather than 10 GT/s.  But DDR5 will run at double that -- so it is already running at low-end transceiver speeds.  Moving to even 16 Gb/s transceivers would still take a ton of channels to match the bandwidth of a single DDR4/5 memory interface.

Quote
Now here is the thing - those of you who are old enough will recall that this problem has already been solved in the past with PCI bus. Back in the old days "classic" PCI bus has the exact same problem of physical design (multi-drop parallel bus) limiting available bandwidth. How was that problem solved? Yep, by serializing the protocol and using several multi-gigabit serial lanes instead of a crapton of parallel ones. Meet the PCI Express. That bus has proven that you can reach very high speeds even through connectors (16 Gbps per lane for PCI Express 4, 32 Gbps for PCI Express 5).

True, but PCIe -- for at least a long time -- had worse latency than plain old PCI or especially AGP (which didn't need to do the parallel bus arbitration and so on).  This didn't matter because the things you put on a PCIe interface have way worse latency anyway.  PCIe is especially bad at non-posted read transactions which is usually solved by DMA, but a CPU is dominated by blocking reads since that can't be cached.  PCIe best case latency also became much worse as soon as you had contention since it was shared, although AGP didn't need to deal with that.

That said, just because PCIe has a lot of latency doesn't mean that is completely intrinsic to serial tranceivers.  I haven't worked at low enough level to understand the details, but I am guessing a lot of PCIe latency has to do with the packet transaction model and the need to support switches an so on.  I could believe that a dedicated RAM interface could now reach low enough added latency to be acceptable for many applications, especially things like GPUs and many FPGA applications.  For CPUs I think it would be a really hard challenge to get acceptable performance.
 

Offline Lukas

  • Frequent Contributor
  • **
  • Posts: 412
  • Country: de
    • carrotIndustries.net
Re: MoSys Quazar QPR memory - RAM access via transceivers
« Reply #10 on: November 25, 2020, 10:11:15 pm »
They aren't the first to connect DRAM via serdes interfaces. Micron tried with their Hybrid Memory Cube products, but decided to drop them in favor of HBM.
 

Offline langwadt

  • Super Contributor
  • ***
  • Posts: 4857
  • Country: dk
Re: MoSys Quazar QPR memory - RAM access via transceivers
« Reply #11 on: November 26, 2020, 02:00:34 am »
so like pci-e ..
 


Share me

Digg  Facebook  SlashDot  Delicious  Technorati  Twitter  Google  Yahoo
Smf