General > General Technical Chat

MoSys Quazar QPR memory - RAM access via transceivers

(1/3) > >>

asmi:
Just came across this product: https://mosys.com/products/quazar-family/quazar-qpr-quad-partition-rate-memories/ (before anybody get any ideas - I'm not affiliated with the company in any way, and I doubt they will even give me few samples of these things 'cause they are expensive).
The basic idea is to tunnel all memory traffic through a bunch of 10G+ transceivers. This idea came to my mind before, but I didn't think there were any existing devices that were actually doing it.
Now, the reason that specific device is expensive is because it's super-high speed dual-port SRAM, but I would like to discuss the basic idea. What if instead of SRAM there were a bunch of DDRx dies, or - better yet - an HBM2 stack or two? The advantages are quite obvious - routing 4-8-16 10G differential pairs is orders of magnitude easier than x64 DDR3/4 memory interfaces, entire solution will be much more compact and likely more power-efficient (DDRx waste quite a bit of power for termination alone, nevermind anything else!). I wonder if anyone did any experiments in this area. I know in Xilinx world there is a free IP to tunnel AXI4 bus requests between FPGAs using variety of physical transports (including transceivers), so I wonder if anyone has any experience to share.

Boscoe:
I’m no expert and I’m not entirely sure I understand your suggestion but I imagine it’s because transceivers take up a lot of die area increasing cost. I imagine custom packaged memories with some logic and their own transceivers would be costly to develop and too specific to fit many applications. You can achieve the same thing with many DDRx memories and the internal logic of the FPGA, why add in a load of 10G transceivers as well? For ease of development? I think by this point you’ve already got seven figure funding and so it’s not really a problem.

ejeffrey:
I think the biggest problem would be latency.  Transceivers typically add a minimum of a few cycles of delay in the parallel clock domain on each end.  10 ns of latency added to a high speed network interface is nothing, but for DRAM every nanosecond counts.  In addition, most tranceiver based interfaces are packetized, and packet processing would add additional latency and overhead for the small transactions common with RAM.  Rambus DRAM back in the P4 era used a fast, narrow (but parallel) bus with packetized traffic and suffered latency compared to the DDR SDRAM it competed with.

I'm not sure how power would come out.  10G+ Transceivers aren't exactly low power devices, and you would probably still have a smaller parallel bus at the DRAM side.  In principle each memory chip could have an on-chip controller and a dedicated lane, but that would be inflexible, and also maybe make latency worse depending on how you stripe your data.  I guess you would more likely end up with a memory controller on each memory module with ~4 lanes dedicated to it and a parallel array of DRAM chips connected to that controller.  My guess then is that any power savings on the DRAM parallel interface would be more than cancelled by the transceivers.

I do think it could be useful for some applications that aren't super latency sensitive, especially if you need only moderate bandwidth -- even down to a single lane.

asmi:

--- Quote from: ejeffrey on November 24, 2020, 10:38:10 pm ---I think the biggest problem would be latency.  Transceivers typically add a minimum of a few cycles of delay in the parallel clock domain on each end.  10 ns of latency added to a high speed network interface is nothing, but for DRAM every nanosecond counts.  In addition, most tranceiver based interfaces are packetized, and packet processing would add additional latency and overhead for the small transactions common with RAM.  Rambus DRAM back in the P4 era used a fast, narrow (but parallel) bus with packetized traffic and suffered latency compared to the DDR SDRAM it competed with.
--- End quote ---
Latency is rarely a problem for FPGA designs - as long as it's consistent, you can always pipeline your design to accommodate latency. Bandwidth and throughput have always been more important in my designs than latency. As a matter of fact, the only time I can remember latency to be critical was in the ALU of my RV64I core. In all other cases the limiting factor was usually bandwidth, not even FPGA resources.


--- Quote from: ejeffrey on November 24, 2020, 10:38:10 pm ---I'm not sure how power would come out.  10G+ Transceivers aren't exactly low power devices, and you would probably still have a smaller parallel bus at the DRAM side.  In principle each memory chip could have an on-chip controller and a dedicated lane, but that would be inflexible, and also maybe make latency worse depending on how you stripe your data.  I guess you would more likely end up with a memory controller on each memory module with ~4 lanes dedicated to it and a parallel array of DRAM chips connected to that controller.  My guess then is that any power savings on the DRAM parallel interface would be more than cancelled by the transceivers.

--- End quote ---
The idea is to have a system in a package, when DRAM dies are connected to the "controller" die via interposer. This way you can have your DDRx dies running at quite high frequencies without the need for termination because the lines are going to be extremely short (following 1/10 rule).
So the conceptual solution will be the same as the product I've linked in the OP - you will have FPGA connected to a memory device via a handful of serial links and so you will have 16 or 32 traces to route as opposed to 120+ for a typical DDR3/4 interface.

Daixiwen:
Latency becomes a problem if you are doing a lot of random accesses in memory. However if you are only doing continuous accesses then using transceivers could give you better performance.
A similar idea was used in the late 90s with the Rambus interface on the Pentium 4 machines (not actually using transceivers, but replacing the parallel bus with a half serial protocol using fewer signals). Even if in theory the bandwidth was higher than SDR, in practise Rambus systems were a lot slower in most practical uses, because of the added latency, and it was abandoned, at least in PCs.

Navigation

[0] Message Index

[#] Next page

There was an error while thanking
Thanking...
Go to full version
Powered by SMFPacks Advanced Attachments Uploader Mod