Author Topic: FPGA with 64 kByte en-bloc RAM?  (Read 2092 times)

0 Members and 2 Guests are viewing this topic.

Online ebastlerTopic starter

  • Super Contributor
  • ***
  • Posts: 6840
  • Country: de
FPGA with 64 kByte en-bloc RAM?
« on: April 24, 2020, 09:04:56 am »
I am playing with a design which replicates an old 8-bit microprocessor and 64kByte of RAM on an FPGA. I'd like this to run as fast as possible, preferaby at 100 MHz.

I have currently implemented this on a Spartan-6 (LX9 size, -3 speed grade). Unfortunately the access times to the on-board Block RAM, and specifically the network path delays, limit the clock rate to approx. 70 MHz: While the CPU core is small, all RAM blocks on the chip are required for the 64kByte RAM. That results in painfully long signal paths for both, the address and data bus, at least to the "outer" RAM blocks.

Since a CPU cycle requires the data to travel from RAM to CPU, be processed to determine the new address (among other things), and then the address to travel back to the RAM in preparation of the next cycle, the path delays enter the cycle time twice. They easily add up to 5 ns, eating up half of my target cycle time.

Is there an FPGA family in roughly the same price and performance class as the Spartan-6 which has larger, more "centralized" RAM blocks on-chip? Thanks!

(Any other suggestions on how things could be sped up are appreciated as well, of course! I have thought about caching, but that gets unwieldy very quickly...)
 

Offline BrianHG

  • Super Contributor
  • ***
  • Posts: 7999
  • Country: ca
Re: FPGA with 64 kByte en-bloc RAM?
« Reply #1 on: April 24, 2020, 09:20:39 am »
Just get a CPU with a 1 instruction pre-fetch cache (Its actually a fake cache as this is using a block ram with latched address inputs & latched data outputs, 2 clock delay + 1 additional output register to test the opcode to see if it's a call/goto/return).  I made my own 16 bit PIC compatible CPU on an Altera Cyclone, full 200Mhz on the slow ones, 280Mhz on the higher speed grades.  This means that even though the ram reads have a 2 clock cycle delay, sine I increment a read address on every CPU cycle, the processor gives me the true 200 MIPS.  The trick is that all jumps/goto/calls take 3 clock cycles as the cached next instruction in the read pipe which will have the wrong next instruction in the ram read pipe after a call or goto, will be forced to a NOP function as the new call/goto read from ram program instruction takes an additional clock to become valid.  Almost like a normal PIC, call/goto/returnlw/return takes 2 CPU cycles.

As for CPU ram access, writes also have a 1 byte cache so that re-reads of the same address just written to will override reading from block memory so a string of inc/dec/subwf/addwf/iorwf/andwf/xorwf will always have the right source data ready as a write to block memory takes an additional  clock cycles to pipe through.

All ALU functions happen 1 clock after a a read instruction as the instruction has the required data ram read/write address.

Note that this was 15 years ago.  I remember having a similar speed barrier and I believe this is how I broke it.
« Last Edit: April 24, 2020, 09:32:55 am by BrianHG »
 
The following users thanked this post: ebastler

Online ebastlerTopic starter

  • Super Contributor
  • ***
  • Posts: 6840
  • Country: de
Re: FPGA with 64 kByte en-bloc RAM?
« Reply #2 on: April 24, 2020, 10:01:41 am »
Thank you, Brian! I can see how the instruction pre-fetch would work round the bottlenck while accessing multi-byte instructions and multiple instructions in a row -- as long as they don't branch or access data from elsewhere in the RAM to break the flow.

I should probably mention that I am fiddling with the 6502 specifically (yeah, nostalgia...). Which means a minimal complement of registers, and hence frequent read and write access to data in the RAM. But still -- assuming that I can place the zero-page and stack RAM right next to the CPU to allow regular 1-cycle access to these, then only instructions with absolute memory addressing should incur an extra cycle penalty. (And of course branches, subroutine calls etc.)

Thanks for the idea! Unless someone comes up with a magic bullet in the shape of an FPGA architecture which is ideal for my (somewhat untypical) requirement, I'll move ahead with the Spartan-6 and think about tweaking the design via pre-fetch/caching in a second step.
 

Offline BrianHG

  • Super Contributor
  • ***
  • Posts: 7999
  • Country: ca
Re: FPGA with 64 kByte en-bloc RAM?
« Reply #3 on: April 24, 2020, 10:52:13 am »
Thank you, Brian! I can see how the instruction pre-fetch would work round the bottlenck while accessing multi-byte instructions and multiple instructions in a row -- as long as they don't branch or access data from elsewhere in the RAM to break the flow.

I should probably mention that I am fiddling with the 6502 specifically (yeah, nostalgia...). Which means a minimal complement of registers, and hence frequent read and write access to data in the RAM. But still -- assuming that I can place the zero-page and stack RAM right next to the CPU to allow regular 1-cycle access to these, then only instructions with absolute memory addressing should incur an extra cycle penalty. (And of course branches, subroutine calls etc.)


Thanks for the idea! Unless someone comes up with a magic bullet in the shape of an FPGA architecture which is ideal for my (somewhat untypical) requirement, I'll move ahead with the Spartan-6 and think about tweaking the design via pre-fetch/caching in a second step.

My program counter had a 'hardware fixed register' stack.  Though the stack was 24 or 32 branches long, since it was a simple FIRST IN, LAST OUT sequencer sitting right on the program counter, it had a 0 instruction cycle wait state.  EG: (my program counter (please be gentle, I was a beginner 20 years ago when I wrote it)
(This module made a 'skip_op_out' flag which told the CPU/ALU/ and returned back into this module's 'skip_op' input when a goto/call/return or interupt has taken place so the rest of the system knows to ignore the cached instruction inside that 1 pre-fetched instruction.  This module also supports paging in the program counter so you can have 16 megabytes of addressable memory with a 16 bit data path.)

Code: [Select]
module pc24 ( reset, clk, ena, skip_op, op_code, work_reg, pc, skip_op_out, ppage_sel );

input  reset, clk, ena, skip_op;
input  [23:0] op_code;
input  [15:0] work_reg;

output [15:0] pc;
reg    [15:0] pc, last_pc, pc_return1, pc_return2, pc_return3, pc_return4, pc_return5, pc_return6, pc_return7, pc_return8;
reg    [15:0] pc_return9, pc_returna, pc_returnb, pc_returnc, pc_returnd, pc_returne, pc_returnf, pc_returng;
reg    [15:0] pc_returnh, pc_returni, pc_returnj, pc_returnk, pc_returnl, pc_returnm, pc_returnn, pc_returno;

output skip_op_out;
reg    skip_op_out;

input  [15:0] ppage_sel;

//******************************************
// PROGRAM COUNTER
//******************************************
integer opc_24;

always @ ( posedge clk ) begin
opc_24[15:0]  = op_code[15:0];
opc_24[23:16] = ppage_sel[7:0];

if (reset) pc <= 0;
else if (ena) begin
last_pc <= pc;
if (op_code[15+8] && !skip_op) begin
skip_op_out <= 1;

if ( (op_code[14+8] == 'b0 ) && op_code[12+8] ) begin // any call
pc_return1 <= last_pc + 1;
pc_return2 <= pc_return1 ;
pc_return3 <= pc_return2 ;
pc_return4 <= pc_return3 ;
pc_return5 <= pc_return4 ;
pc_return6 <= pc_return5 ;
pc_return7 <= pc_return6 ;
pc_return8 <= pc_return7 ;
pc_return9 <= pc_return8 ;
pc_returna <= pc_return9 ;
pc_returnb <= pc_returna ;
pc_returnc <= pc_returnb ;
pc_returnd <= pc_returnc ;
pc_returne <= pc_returnd ;
pc_returnf <= pc_returne ;
pc_returng <= pc_returnf ;
pc_returnh <= pc_returng ;
pc_returni <= pc_returnh ;
pc_returnj <= pc_returni ;
pc_returnk <= pc_returnj ;
pc_returnl <= pc_returnk ;
pc_returnm <= pc_returnl ;
pc_returnn <= pc_returnm ;
pc_returno <= pc_returnn ;
end

     if (op_code[14+8:13+8] == 'b00) pc <= opc_24[15:0];   // goto or call a
else if (op_code[14+8:13+8] == 'b01) pc <= opc_24[15:0] + work_reg[15:0]; // goto or call a + work_reg

else if (op_code[14+8:13+8] == 'b11)    begin   // return, or retlw
pc         <= pc_return1 ;
pc_return1 <= pc_return2 ;
pc_return2 <= pc_return3 ;
pc_return3 <= pc_return4 ;
pc_return4 <= pc_return5 ;
pc_return5 <= pc_return6 ;
pc_return6 <= pc_return7 ;
pc_return7 <= pc_return8 ;
pc_return8 <= pc_return9 ;
pc_return9 <= pc_returna ;
pc_returna <= pc_returnb ;
pc_returnb <= pc_returnc ;
pc_returnc <= pc_returnd ;
pc_returnd <= pc_returne ;
pc_returne <= pc_returnf ;
pc_returnf <= pc_returng ;
pc_returng <= pc_returnh ;
pc_returnh <= pc_returni ;
pc_returni <= pc_returnj ;
pc_returnj <= pc_returnk ;
pc_returnk <= pc_returnl ;
pc_returnl <= pc_returnm ;
pc_returnm <= pc_returnn ;
pc_returnn <= pc_returno ;
end
end else begin
pc <= pc +1;
skip_op_out <= 0;
end // !(op_code[15+8] && !skip_op)
end // (ena)
end // (always)

endmodule
« Last Edit: April 24, 2020, 10:54:11 am by BrianHG »
 
The following users thanked this post: ebastler

Offline Chalcogenide

  • Regular Contributor
  • *
  • Posts: 165
  • Country: it
Re: FPGA with 64 kByte en-bloc RAM?
« Reply #4 on: April 24, 2020, 11:00:52 am »
Regarding your original question, just look one generation ahead (Spartan 7): the native block RAM size doubled to 36Kb, so you will need fewer interconnects to obtain the 64 KB of RAM. This, together with the faster logic might make your original design feasible. However, to get enough block RAM you need to get at least the XC7S25 which is about twice as expensive as your Spartan 6 LX9.
 
The following users thanked this post: ebastler

Offline BrianHG

  • Super Contributor
  • ***
  • Posts: 7999
  • Country: ca
Re: FPGA with 64 kByte en-bloc RAM?
« Reply #5 on: April 24, 2020, 11:13:24 am »
I should probably mention that I am fiddling with the 6502 specifically (yeah, nostalgia...). Which means a minimal complement of registers, and hence frequent read and write access to data in the RAM. But still -- assuming that I can place the zero-page and stack RAM right next to the CPU to allow regular 1-cycle access to these, then only instructions with absolute memory addressing should incur an extra cycle penalty. (And of course branches, subroutine calls etc.)
Like I said earlier, there is a second trick you haven't realized....

I'm using dual port ram.  1 port reads exclusively the op-codes fed exclusively by the program counter.
The second port with it's own address exclusively reads/writes ALU data.

When writing data to an address, keep a single copy of that stored data AND address in a single register so if your CPU reads that exact same address in the next instruction, take the data from that register since writing to the main block memory will take an additional 2 clocks before the same address data output has the new written data.

(Unless you set your block ram function to make the read data valid immediately after a write to the same address, but this feature in Quartus hits your FMAX with a penalty of almost 50%.)

Your job will be a little tricky adapting a 6502 core, but you should easily pass 200MHz, 0 wait states except on jmp/call/return especially if you use my dedicated hardware stack algorithm as an example.
« Last Edit: April 24, 2020, 11:18:44 am by BrianHG »
 

Online ebastlerTopic starter

  • Super Contributor
  • ***
  • Posts: 6840
  • Country: de
Re: FPGA with 64 kByte en-bloc RAM?
« Reply #6 on: April 24, 2020, 11:24:15 am »
Like I said earlier, there is a second trick you haven't realized....

I'm using dual port ram.  1 port reads exclusively the op-codes fed exclusively by the program counter.
The second port with it's own address exclusively reads/writes ALU data.

Ahaa!! Indeed, I had missed your dual-port approach entirely. That is pretty clever! Kind of a hybrid Harvard/von Neuman architecture, with separate address and data buses although it all ends up in the same memory.

Now that will give me some food for thought for the 6502...  ;)
 

Online ebastlerTopic starter

  • Super Contributor
  • ***
  • Posts: 6840
  • Country: de
Re: FPGA with 64 kByte en-bloc RAM?
« Reply #7 on: April 24, 2020, 11:27:48 am »
Regarding your original question, just look one generation ahead (Spartan 7)

Thank you -- I had overlooked the larger RAM blocks. That's a nice fallback option indeed. So if I fail to get a handle on Brian's cache concepts in a 6502 version, I can always throw a bit of money and hardware at the problem...
 

Offline SiliconWizard

  • Super Contributor
  • ***
  • Posts: 15118
  • Country: fr
Re: FPGA with 64 kByte en-bloc RAM?
« Reply #8 on: April 24, 2020, 02:35:51 pm »
One thing to look at - Spartan 6 have RAMB8 and RAMB16 blocks. You may want to take a look in the report what is used exactly in your design.

But redesigning your code a bit, you should be able to get what you want. I've implemented a design on a Spartan 6 LX9 which was a small video encoder using almost all BRAM available (but with maybe a difference that I was using mainly two different dual-port memories of ~32KB each, and 12-bit width, instead of a big one, so that may have helped.) Managed to get it to run at ~150MHz.
 

Offline rstofer

  • Super Contributor
  • ***
  • Posts: 9921
  • Country: us
Re: FPGA with 64 kByte en-bloc RAM?
« Reply #9 on: April 24, 2020, 02:49:14 pm »
You might as well move up to the Artix 7 chips, they have a LOT of BlockRAM  See page 3

https://www.xilinx.com/support/documentation/selection-guides/7-series-product-selection-guide.pdf

I tend to buy the 100T variant (future proof) but the 35T will cover a lot of projects.  Digilent makes boards with both chips.

https://store.digilentinc.com/fpga-programmable-logic/by-technology/artix/

I primarily use the Nexys A7 board (it used to be called the Nexys 4 DDR) because I like lots of gadgets on the board.
« Last Edit: April 24, 2020, 02:52:03 pm by rstofer »
 

Offline SiliconWizard

  • Super Contributor
  • ***
  • Posts: 15118
  • Country: fr
Re: FPGA with 64 kByte en-bloc RAM?
« Reply #10 on: April 24, 2020, 07:16:36 pm »
Sure, a 100T has 4860KBits BRAM - compared to 576Kb for the LX9. But you can't compare an LX9 to a Artix 7-100T. This is a completely different league.

The LX9 is currently MUCH cheaper than a 100T, and if its specs fit the requirements in a given project, going for a 100T would be like using a hammer to crush a fly. Not that it's not a nice beast.

But the OP said same price and performance class range. LX9: about $20 on average depending on package on Digikey, a 100T is over $100. Even a 35T is over twice the price of an LX9, but with 1800Kb of BRAM, it's not too shabby indeed.


 

Online ebastlerTopic starter

  • Super Contributor
  • ***
  • Posts: 6840
  • Country: de
Re: FPGA with 64 kByte en-bloc RAM?
« Reply #11 on: April 24, 2020, 07:22:28 pm »
You might as well move up to the Artix 7 chips, they have a LOT of BlockRAM  See page 3

Hmm; the smaller Artix 7s are not as expensive as I had thought!

But unfortunately Xilinx does not offer them in a suitable package for my intended application: I want to fit the FPGA onto a DIP-40 sized PCB. The Spartan-6 and 7 come in a 225-ball BGA package which is just small enough to fit. For the Artix 7 series, the only package with a small-enough footprint is a 0.5mm pitch BGA, but that one is beyond my (and the cheap Chinese fabs') capabilities.

I'll give the Artix a try next time I have a less space-constrained application. About time I move closer to the present day in my choice of FPGAs...
 

Online ebastlerTopic starter

  • Super Contributor
  • ***
  • Posts: 6840
  • Country: de
Re: FPGA with 64 kByte en-bloc RAM?
« Reply #12 on: April 24, 2020, 07:29:23 pm »
One thing to look at - Spartan 6 have RAMB8 and RAMB16 blocks. You may want to take a look in the report what is used exactly in your design.

But redesigning your code a bit, you should be able to get what you want. I've implemented a design on a Spartan 6 LX9 which was a small video encoder using almost all BRAM available (but with maybe a difference that I was using mainly two different dual-port memories of ~32KB each, and 12-bit width, instead of a big one, so that may have helped.) Managed to get it to run at ~150MHz.

The synthesis tool (well, the IP core generator for the RAM, I guess) chose to use 18kb blocks. I assume that saves one bit of address decoding in external logic. But both, 9kb and 18kb blocks, use the same actual RAM blocks, and the choice should not have much impact on the dominant long path delays to the outer blocks, right?

I agree with you and BrianHG: I need to make the CPU a bit smarter, to avoid the dual path delay (data from RAM, address back to RAM) in a single clock period. Cache and/or pre-fetch seems the way to go.
 

Offline SiliconWizard

  • Super Contributor
  • ***
  • Posts: 15118
  • Country: fr
Re: FPGA with 64 kByte en-bloc RAM?
« Reply #13 on: April 24, 2020, 09:26:22 pm »
You might as well move up to the Artix 7 chips, they have a LOT of BlockRAM  See page 3

Hmm; the smaller Artix 7s are not as expensive as I had thought!

But unfortunately Xilinx does not offer them in a suitable package for my intended application: I want to fit the FPGA onto a DIP-40 sized PCB. The Spartan-6 and 7 come in a 225-ball BGA package which is just small enough to fit. For the Artix 7 series, the only package with a small-enough footprint is a 0.5mm pitch BGA, but that one is beyond my (and the cheap Chinese fabs') capabilities.

I'll give the Artix a try next time I have a less space-constrained application. About time I move closer to the present day in my choice of FPGAs...

Unfortunately for the Spartan-7 series, they have less BRAM than the 6-series for models with an equivalent number of LUTs. So in that regard, they are not much of a good deal.

 

Offline BrianHG

  • Super Contributor
  • ***
  • Posts: 7999
  • Country: ca
Re: FPGA with 64 kByte en-bloc RAM?
« Reply #14 on: April 24, 2020, 10:49:41 pm »
I agree with you and BrianHG: I need to make the CPU a bit smarter, to avoid the dual path delay (data from RAM, address back to RAM) in a single clock period. Cache and/or pre-fetch seems the way to go.
If you are making the 6502 instruction decoder yourself, and some opcodes take different number of bytes, a simple parallel 4 byte latch coming off of the ram data output where you snap your 8/16/24 bit function in parallel depending on the first op-code header.  In other words, my original Harvard architecture which emulated a PIC only needed 1 byte prefetch since there was 1 word of rom per instruction always.  Your 6502 would want a variable size & sort prefetch depending on op-code.  If I remember correctly, the 6502 opcode has a bit standing out which defines the opcode size in bytes.

On the second port side data bus, you would only need a 1 byte fast-read back cache for a back to back 'write byte' then 'read' the same data byte on the next clock cycle, if that's even possible for a 6502.
 


Share me

Digg  Facebook  SlashDot  Delicious  Technorati  Twitter  Google  Yahoo
Smf