Author Topic: Multiported RAMs [SOLVED]  (Read 2256 times)

0 Members and 1 Guest are viewing this topic.

Offline SMB784Topic starter

  • Frequent Contributor
  • **
  • Posts: 421
  • Country: us
    • Tequity Surplus
Multiported RAMs [SOLVED]
« on: February 24, 2021, 05:44:39 pm »
I am looking to make a multiported ram, with a whole lot of ports.  The standard SLICEM RAM blocks for Xilinx parts cap out at about 16 ports, but I would like to make one that has at most 256 read ports.  The RAM width and depth are parametrizable, and nominally 10 bits wide and 1024 elements deep respectively.  I am looking into LVT RAMs but they seem to take up a large number of RAM blocks.  Anyone have any reasonable suggestions?
« Last Edit: March 06, 2021, 09:10:22 pm by SMB784 »
 

Online nctnico

  • Super Contributor
  • ***
  • Posts: 27862
  • Country: nl
    • NCT Developments
Re: Multiported RAMs
« Reply #1 on: February 24, 2021, 06:57:33 pm »
What is the latency you can allow? I have solved these kind of problems using a round-robin arbiter.
There are small lies, big lies and then there is what is on the screen of your oscilloscope.
 
The following users thanked this post: SMB784

Offline hamster_nz

  • Super Contributor
  • ***
  • Posts: 2812
  • Country: nz
Re: Multiported RAMs
« Reply #2 on: February 24, 2021, 07:00:45 pm »
For n-read, 1-write memory each phyciical read port you add requires another block of RAM.

The only reasonable solution might be to run the memory at a multiple of the system speed, allowing multiple accesses using the same physical port.
Gaze not into the abyss, lest you become recognized as an abyss domain expert, and they expect you keep gazing into the damn thing.
 
The following users thanked this post: SMB784

Online BrianHG

  • Super Contributor
  • ***
  • Posts: 8078
  • Country: ca
Re: Multiported RAMs
« Reply #3 on: February 24, 2021, 07:08:34 pm »
In the 8bit GPU thread, we got 15 read/write ports at 25MHz plus a turbo one at 125MHz, or it could have been 20 ports at 25MHz.  No arbitration, these are true all parallel access.  Going down to 2.5MHz you could have 200 parallel ports expanding the code.  This is on a CycloneIV, no added latency for the 20 ports @ 25MHz.  IE: reads take 2 clock cycles from the time the address is sent. (Note that the CycloneIV only has M9K 2 port ram where we used all 480k in 1 chunk, with 16 port ram, your figures should at least another 8 fold faster, make that 16 fold with a faster core.  Multiply that again and again if you want to use mirror images of the ram with half the write ports.)

I would expect the faster Xilinx Artix series to run twice as fast or twice as many ports.
« Last Edit: February 24, 2021, 07:15:43 pm by BrianHG »
 
The following users thanked this post: SMB784

Offline SMB784Topic starter

  • Frequent Contributor
  • **
  • Posts: 421
  • Country: us
    • Tequity Surplus
Re: Multiported RAMs
« Reply #4 on: February 24, 2021, 07:45:07 pm »
Here's the situation I'm dealing with:

I have a bunch of parallel compute cores (for the sake of argument, lets say there are 128 cores).  Each core is reading one random entry from  a common 10-bit wide 1024 entry deep memory per clock.  This value is manipulated throughout the computational pipeline; the manipulated value (still 10 bits wide) is written to a second bank at the end of the pipeline.

I would like to keep my read & write latency to 1 clock cycle if possible (i.e. each core only takes 1 cycle to read or write from/to the memory).

One wrinkle here is that my read addresses are not truly random, only pseudo random.  Specifically I am using one 10-bit linear-feedback shift register to generate a 128 element array of psuedorandom addresses per clock.  Because LFSRs only repeat at the end of their cycle, there are no bank conflicts for these random addresses as long as I dont let the LFSR run past 8 clock cycles (1024 memory entries divided by 128 pseudorandom numbers per clock means the LFSR only cycles every 8 clocks).  So I was thinking about doing a banked multiport memory (one 1024 element memory, but divided into 128 memory segments 8 elements deep, thus providing 128 read/write ports), but I dont know how to make the banking MUX work so that it assigns the pseudorandom memory address to the correct bank port.  Any thoughts?
« Last Edit: February 24, 2021, 07:49:55 pm by SMB784 »
 

Offline asmi

  • Super Contributor
  • ***
  • Posts: 2794
  • Country: ca
Re: Multiported RAMs
« Reply #5 on: February 24, 2021, 08:45:37 pm »
So far it sounds like an XY problem to me. Can you please tell us what are you trying to accomplish, not how are you trying to do it?

On a technical side - if that memory is only read from multiple ports, but written to using only single port, you can just duplicate your memories like hamster_nz suggested. Also if there is a way to somehow predict access addresses, you can go wide and read a bunch of values in a single cycle.

Offline SMB784Topic starter

  • Frequent Contributor
  • **
  • Posts: 421
  • Country: us
    • Tequity Surplus
Re: Multiported RAMs
« Reply #6 on: February 24, 2021, 09:41:10 pm »
So far it sounds like an XY problem to me. Can you please tell us what are you trying to accomplish, not how are you trying to do it?

Sure, it's what I said in my previous post:

I need to read 128 data elements from a shared memory simultaneously, perform some math on each of them in parallel, and push the results to another memory at the end of the pipe.  I'm trying to reduce latency, so 1 read/write per clock is preferable. My memory addresses that I am using to read are deterministic, not random (pseudorandom really, but with a determinstic period), and never conflict.  The read side memory control at the moment is the more important one.
« Last Edit: February 24, 2021, 09:43:02 pm by SMB784 »
 

Offline hamster_nz

  • Super Contributor
  • ***
  • Posts: 2812
  • Country: nz
Re: Multiported RAMs
« Reply #7 on: February 25, 2021, 12:19:53 am »
What are the values you are reading? Constants? Samples? Magic numbers (for crypto) ?


Gaze not into the abyss, lest you become recognized as an abyss domain expert, and they expect you keep gazing into the damn thing.
 

Online BrianHG

  • Super Contributor
  • ***
  • Posts: 8078
  • Country: ca
Re: Multiported RAMs
« Reply #8 on: February 25, 2021, 12:23:31 am »
My memory addresses that I am using to read are deterministic, not random (pseudorandom really, but with a determinstic period), and never conflict.  The read side memory control at the moment is the more important one.
Read ahead if the address is deterministic... 0 clock delay reads.
 
The following users thanked this post: Someone, SMB784

Offline SMB784Topic starter

  • Frequent Contributor
  • **
  • Posts: 421
  • Country: us
    • Tequity Surplus
Re: Multiported RAMs
« Reply #9 on: February 25, 2021, 12:27:59 am »
What are the values you are reading? Constants? Samples? Magic numbers (for crypto) ?

I am reading 10 bit unsigned integers from the read memory bank, and writing 10 bit unsigned integers to the write memory bank.  These numbers represent measurement data taken by my system that I am manipulating mathematically via an equation implemented in hardware, and storing the results in the write memory bank.

Quote
Read ahead if the address is deterministic... 0 clock delay reads.
I'm a little new to this, what do you mean by read ahead?

Online SiliconWizard

  • Super Contributor
  • ***
  • Posts: 15258
  • Country: fr
Re: Multiported RAMs
« Reply #10 on: February 25, 2021, 12:48:42 am »
256 read ports for a 1024-word deep memory? That means that 1/4 of the memory should be readable on the same cycle (if you really want that), as worst case would be 256 different addresses. That will in effect take up a lot of resources. 256 ports for simultaneous access sound insane resource-wise.

Best bet would be to pipeline it as already suggested. This will additionally allow you to pipeline any processing you'll have to do on this data.
Any reason why latency here would be a problem?
 
The following users thanked this post: SMB784

Offline SMB784Topic starter

  • Frequent Contributor
  • **
  • Posts: 421
  • Country: us
    • Tequity Surplus
Re: Multiported RAMs
« Reply #11 on: February 25, 2021, 12:59:38 am »
256 read ports for a 1024-word deep memory? That means that 1/4 of the memory should be readable on the same cycle (if you really want that), as worst case would be 256 different addresses. That will in effect take up a lot of resources. 256 ports for simultaneous access sound insane resource-wise.

Best bet would be to pipeline it as already suggested. This will additionally allow you to pipeline any processing you'll have to do on this data.
Any reason why latency here would be a problem?

Fair enough, I figured this might be a bridge too far. I will look into pipelining strategies for this memory controller.

As regards latency, I need complete an entire calculation before the next measurement is acquired, because the program determines whether the next measurement point is taken, and the faster I can do that, the higher bandwidth response I get.

Online BrianHG

  • Super Contributor
  • ***
  • Posts: 8078
  • Country: ca
Re: Multiported RAMs
« Reply #12 on: February 25, 2021, 01:01:44 am »
Quote
Read ahead if the address is deterministic... 0 clock delay reads.
I'm a little new to this, what do you mean by read ahead?
Well, as an example, with Cyclones, the large M9K dual port rams operate at the highest possible FMAX with a read pipeline delay of 2 system clocks, so, I always use that 2 clock cycle delay and have to engineer around it.

Now in your case, if you know the read address ahead of time, just send that address the 1 or 2 clocks early and your 'read' data will be ready by the time you need it, IE: 0 clock delay.

Remember, just because the read takes 1 or 2 clocks until you get a result, this doesn't means you have to wait for that result before you can send in another address.  You can send a new address on every single clock cycle.  It's only that the 'data' coming out of the ram block is the result of the address you sent in 2 clock cycles earlier.  This is what we mean by pipe-lining.  This is akin to a look-ahead cache.  Learning how to deal/code around this is a must with FPGA if you are targeting pure throughput performance.

However, if you are waiting for the data to be written first, it already sounds like your processor is faster than the acquisition rate, hence why use any ram.  Process in real time.
« Last Edit: February 25, 2021, 01:03:17 am by BrianHG »
 
The following users thanked this post: SMB784

Online Someone

  • Super Contributor
  • ***
  • Posts: 4914
  • Country: au
    • send complaints here
Re: Multiported RAMs
« Reply #13 on: February 25, 2021, 01:45:25 am »
This sounds like the classic jumbling/packing/byte alignment tasks that FPGAs are great at, but only when you stop thinking about linear address buffers. As above, if each data sink is retrieving a set pattern of addresses then its a tiny fifo (or single layer register) that only gets written to on those addresses. The tools won't magically infer efficient patterns for this when you have prior knowledge of things like all the n port address are sure to be unique. At that point its time to refactor the algorithm to match the target architecture.
 
The following users thanked this post: SMB784

Offline NorthGuy

  • Super Contributor
  • ***
  • Posts: 3243
  • Country: ca
Re: Multiported RAMs
« Reply #14 on: February 25, 2021, 04:37:06 am »
Now in your case, if you know the read address ahead of time, just send that address the 1 or 2 clocks early and your 'read' data will be ready by the time you need it, IE: 0 clock delay.

I think the OP's problem is throughput, not latency.

You can read only so many times from the memory in a single clock through a single port, probably around 400 M/sec. If you need to read at every cycle at 400 MHz, you will get only one read per clock, whether you read ahead or not.

If your clock is 10 MHz, you can read 40 times if you read at 400 MHz. This has already been suggested, but OP didn't like the idea. Such read will save resources, but will decrease clock speed.

If you want 400 MHz clock, you must have it's own port for every read. This is the fastest, but will take much more resources.

The OP doesn't show us the big picture. Only his memory-based design. Perhaps, a different approach would yield simpler design and better resource usage.
 
The following users thanked this post: SMB784

Online nctnico

  • Super Contributor
  • ***
  • Posts: 27862
  • Country: nl
    • NCT Developments
Re: Multiported RAMs
« Reply #15 on: February 26, 2021, 12:53:06 am »
So far it sounds like an XY problem to me. Can you please tell us what are you trying to accomplish, not how are you trying to do it?

Sure, it's what I said in my previous post:

I need to read 128 data elements from a shared memory simultaneously, perform some math on each of them in parallel, and push the results to another memory at the end of the pipe.  I'm trying to reduce latency, so 1 read/write per clock is preferable. My memory addresses that I am using to read are deterministic, not random (pseudorandom really, but with a determinstic period), and never conflict.  The read side memory control at the moment is the more important one.
If that is the case you need to create a very wide memory to have a lot of bandwidth and use that to fill FIFOs (or caches but FIFOs are simpler) with data to be processed. The memory width is dictated by the amount of sustained bandwidth you need & cycle time the memory can offer. Going for a multi-port memory design is the wrong way because your processing seems to consist of individual parallel units which don't interact. The FIFOs can be filled at full speed and read at a slower pace (and vice versa at the results side which gets written to memory).
« Last Edit: February 26, 2021, 12:54:57 am by nctnico »
There are small lies, big lies and then there is what is on the screen of your oscilloscope.
 
The following users thanked this post: SMB784

Offline SMB784Topic starter

  • Frequent Contributor
  • **
  • Posts: 421
  • Country: us
    • Tequity Surplus
Re: Multiported RAMs
« Reply #16 on: March 06, 2021, 08:56:26 pm »
I figured it out using a banked, multiport memory solution with an array of n partitions of my RAM routed using n-to-1 MUXs on the n output ports.

Thanks for your help guys :)
« Last Edit: March 06, 2021, 09:09:56 pm by SMB784 »
 


Share me

Digg  Facebook  SlashDot  Delicious  Technorati  Twitter  Google  Yahoo
Smf