Author Topic: 5-stage pipelined CPU and synchronous memory access  (Read 3778 times)

0 Members and 1 Guest are viewing this topic.

Online SiliconWizardTopic starter

  • Super Contributor
  • ***
  • Posts: 15198
  • Country: fr
5-stage pipelined CPU and synchronous memory access
« on: November 01, 2020, 08:40:48 pm »
I've worked on a 5-stage pipelined RISC-V core a while ago, and I'm now back at it to improve things, timings in particular. This thread will be best understood by people familiar with the concepts, as I'm not detailing them much here.

So here is something I'm currently working at. For memory access (load), the typical 5-stage pipeline is like so:
* Execute (EX) stage: compute memory address,
* Memory Access (MA) stage: access memory, sign-extend read value if required,
* Write-back (WB) stage: write result to destination register.

Problem is, when dealing with synchronous memory, it will be read at the MA stage, and thus you can only register the read value at the next clock cycle. Thus, it can't be put into the MA-to-WB register, unless you stall the MA stage for one cycle for ANY memory read. What I had implemented was stalling for one cycle ONLY when there is a load-use hazard, since my pipeline is fully bypassed (data forwarding). (Now of course, keep in mind the problem appears because this is a fully bypassed pipeline. If there were no bypassing, this would be a non-issue.)

This implementation is problematic, because it can't register the read value from memory (+ sign extension!), which gives timing issues. This is not clean.
I'm wondering how this is typically implemented. Looked at a couple soft cores, but coding style varies a lot and they are not always straightforward to figure out at this level of details...

I could of course always stall for one cycle when reading from memory, and for two if there is a load-use hazard. Maybe this is what is commonly done? I had hoped we could avoid this extra stall cycle, and textbooks usually seem to overlook this issue.

Currently my idea to avoid stalling when there is no load-use hazard is to do the following:
* Postpone sign-extension (if required) + registering of the read value at the WB stage,
* Now there are two load-uses cases instead of just one (next instruction AND the instruction after the next), with respectively two and one stall cycle(s).

Benefit is that stalling would be avoided entirely for memory reads if there is no load-use hazard, thus if the next two instructions don't depend on the load. Looks better than stalling for one cycle at least for ALL memory reads.

What's your thought? Is there something simpler that can be done that I overlooked (quite possible)?

 

Offline hamster_nz

  • Super Contributor
  • ***
  • Posts: 2812
  • Country: nz
Re: 5-stage pipelined CPU and synchronous memory access
« Reply #1 on: November 01, 2020, 08:50:47 pm »
(Crazy thought time)

I am guessing you can't clock the memory on the other clock edge, because half a cycle isn't long enough to compute the address?

1/2 cycle - compute address & present to memory
1 cycle - memory access cycle
1/2 cycle - sign extend.
Gaze not into the abyss, lest you become recognized as an abyss domain expert, and they expect you keep gazing into the damn thing.
 

Online SiliconWizardTopic starter

  • Super Contributor
  • ***
  • Posts: 15198
  • Country: fr
Re: 5-stage pipelined CPU and synchronous memory access
« Reply #2 on: November 01, 2020, 08:54:31 pm »
(Crazy thought time)

I am guessing you can't clock the memory on the other clock edge, because half a cycle isn't long enough to compute the address?

1/2 cycle - compute address & present to memory
1 cycle - memory access cycle
1/2 cycle - sign extend.

Well, I don't much like designing with rising *and* falling edges.
Trying to figure out how that would fit in the pipeline, too. At first sight, it would look like memory access would have to be done on the opposite edge of the pipeline's clock edge, which doesn't look very good to me. It might not be that much of a problem if only dealing with a single memory block, but here "memory access" is done through an address and data bus and would equally need to access data memory, instruction memory AND MMIO. Not really ideal IMO.
« Last Edit: November 01, 2020, 09:04:04 pm by SiliconWizard »
 

Online BrianHG

  • Super Contributor
  • ***
  • Posts: 8031
  • Country: ca
Re: 5-stage pipelined CPU and synchronous memory access
« Reply #3 on: November 02, 2020, 03:07:53 am »
Is the memory on-chip or off-chip?
If on-chip, does the OP-Code hint which address work data you need?
Are you completely trying to avoid wait-states, or, just minimize or erase existing ones when possible?
 

Offline Daixiwen

  • Frequent Contributor
  • **
  • Posts: 365
  • Country: no
Re: 5-stage pipelined CPU and synchronous memory access
« Reply #4 on: November 02, 2020, 08:18:53 am »
Other than by having an asynchronous cache memory block close to the CPU I don't really see how you can avoid this stall
 
The following users thanked this post: BrianHG

Online SiliconWizardTopic starter

  • Super Contributor
  • ***
  • Posts: 15198
  • Country: fr
Re: 5-stage pipelined CPU and synchronous memory access
« Reply #5 on: November 02, 2020, 04:12:41 pm »
Other than by having an asynchronous cache memory block close to the CPU I don't really see how you can avoid this stall

That's what I think, was just curious to see if there was some clever idea out there...
My original design was not registering the memory's output and I was just relying on its access time to be fast enough (using block RAM here for instance), but that's very tricky, and to be avoided, especially on FPGAs. Doing this makes timing analysis unreliable in itself, so you don't actually know what's going to happen.

So anyway, I'm currently implementing what I said above, stalling only when there is a load-use hazard (as opposed to adding a wait state for any memory read). That should be OK.

 

Online asmi

  • Super Contributor
  • ***
  • Posts: 2782
  • Country: ca
Re: 5-stage pipelined CPU and synchronous memory access
« Reply #6 on: November 02, 2020, 04:19:38 pm »
I have two stages for memory read, this allows using BRAM output register to improve timing. I also have two stages for decoding, and a small 4+1 QWORDs FWFT FIFO in the fetch stage (again, to enable using BRAM's output register). So far my RV64I core closes at 177.8 MHz with post-route physical optimization. It's not feature-complete yet - it doesn't support interrupts, as I didn't have time to work on this project lately.

Online SiliconWizardTopic starter

  • Super Contributor
  • ***
  • Posts: 15198
  • Country: fr
Re: 5-stage pipelined CPU and synchronous memory access
« Reply #7 on: November 02, 2020, 04:24:48 pm »
Is the memory on-chip or off-chip?
If on-chip, does the OP-Code hint which address work data you need?
Are you completely trying to avoid wait-states, or, just minimize or erase existing ones when possible?

The type of memory doesn't matter (or shouldn't), since I'm working on a generic enough core that should not rely on any specific kind of memory: memory access on a RISC-V architecture can access any kind of memory within the address space. I also don't want to handle address decoding within the core itself. It makes sense to handle only synchronous memory here. Even generic MMIO is by nature synchronous unless you want to go through nasty hoops.

So yes, I am trying to avoid any kind of wait-state if not strictly required. Pure stalling due to load-use hazards is. I have included a wait-state mechanism (there is a memory ready input flag, which would allow slower memory and/or caches), but I don't want to use it if I can avoid it.

Actually, adding two wait cycles for EVERY "load" would solve the issue and even avoid having to handle load-use hazards, but it's suboptimal.
 

Online SiliconWizardTopic starter

  • Super Contributor
  • ***
  • Posts: 15198
  • Country: fr
Re: 5-stage pipelined CPU and synchronous memory access
« Reply #8 on: November 02, 2020, 04:31:27 pm »
I have two stages for memory read, this allows using BRAM output register to improve timing. I also have two stages for decoding, and a small 4+1 QWORDs FWFT FIFO in the fetch stage (again, to enable using BRAM's output register). So far my RV64I core closes at 177.8 MHz with post-route physical optimization. It's not feature-complete yet - it doesn't support interrupts, as I didn't have time to work on this project lately.

Thanks for your concrete input here. Is your pipeline fully bypassed? (Asking this because, the more stages you have, and the 'heavier" bypassing/data hazard control logic becomes...)

I was also thinking of adding more stages, but I don't want to do that just for memory access at this point. (But I admit if you want to run an RV64I core at this speed, 5 stages won't cut it for an FPGA implementation.)

So, I think my approach of postponing data read and sign extension until the WB stage looks reasonable and only requires handling two cases of load-use hazards instead of one.
 

Online asmi

  • Super Contributor
  • ***
  • Posts: 2782
  • Country: ca
Re: 5-stage pipelined CPU and synchronous memory access
« Reply #9 on: November 02, 2020, 05:27:53 pm »
Thanks for your concrete input here. Is your pipeline fully bypassed? (Asking this because, the more stages you have, and the 'heavier" bypassing/data hazard control logic becomes...)
So far I only bypass execute stage as this covers the case I found to be the most common (lui x/aluop (x)). I found that gcc with -O3 optimization switch tends to organize the logic for this style of "chained" operations. Another common occurrence is ld x/op(x), but in my case I only "win" 1 cycle with bypassing this operation, so I'm not sure if it's really worth it. I didn't try yet adding more bypassing and so I'm not sure if it will really affect Fmax - on one hand, the logic to bypass becomes more complex, but on the other hand the logic to stall conditions becomes simpler as there are now less conditions which would cause a stall. But than again - I have implemented peripheral AXI master bus, so accessing devices connected to this bus is going to cause more stalls no matter if the core is fully bypassed or not. I would rather have higher Fmax with more occasional stalls, then full bypassing but lower frequency.
Actually, I think I will need to add a couple of counters to find out how many stalls I'm getting vs instructions retired just to see if there is anything worth fighting. Obviously that will depend on the code running, right now I have some LED blink code running, which interfaces with GPIO block implemented as AXI slate device connected to peripheral AXI bus of CPU.
Also I've been long thinking about doing early jumps handling so that it will start fetching jump target instructions as it's commonly present in loops. This will likely have significant positive impact on performance, as loops are very common in C code.
So many things to do... Sometimes I wish there were more than 24 hours in a day :)

I was also thinking of adding more stages, but I don't want to do that just for memory access at this point. (But I admit if you want to run an RV64I core at this speed, 5 stages won't cut it for an FPGA implementation.)
There is nothing wrong with adding more stages per se, as for CPU throughput is much more important that instruction latency. Also adding more stages after execute does not increase penalty for pipeline flushes due to control transfer (because all stages after execute are always performed to the end regardless of what instructions follow current one), so the only downside is additional bypass logic.

Offline David Hess

  • Super Contributor
  • ***
  • Posts: 17076
  • Country: us
  • DavidH
Re: 5-stage pipelined CPU and synchronous memory access
« Reply #10 on: November 03, 2020, 12:54:29 am »
Well, I don't much like designing with rising *and* falling edges.

That is what multi-phase clocks are for.
 

Online BrianHG

  • Super Contributor
  • ***
  • Posts: 8031
  • Country: ca
Re: 5-stage pipelined CPU and synchronous memory access
« Reply #11 on: November 03, 2020, 10:28:11 am »
Is the memory on-chip or off-chip?
If on-chip, does the OP-Code hint which address work data you need?
Are you completely trying to avoid wait-states, or, just minimize or erase existing ones when possible?

The type of memory doesn't matter (or shouldn't), since I'm working on a generic enough core that should not rely on any specific kind of memory: memory access on a RISC-V architecture can access any kind of memory within the address space. I also don't want to handle address decoding within the core itself. It makes sense to handle only synchronous memory here. Even generic MMIO is by nature synchronous unless you want to go through nasty hoops.

So yes, I am trying to avoid any kind of wait-state if not strictly required. Pure stalling due to load-use hazards is. I have included a wait-state mechanism (there is a memory ready input flag, which would allow slower memory and/or caches), but I don't want to use it if I can avoid it.

Actually, adding two wait cycles for EVERY "load" would solve the issue and even avoid having to handle load-use hazards, but it's suboptimal.

The idea was to have a read ahead into async 2 pipe shift logic cell.  Write going out does the same thing except if there is a match to an address in 1 of the 2 logic cell read pipe, that written data replaces that '1 word cache' early read as well.

The reason for knowing if you are accessing op-code or data is that you have 2 of these read-word and write word pipes separate.  The scale of these pipes change with bit width and if you are accessing DRAM, your burst size.  This is a bare-bottom cache which should allow streaming access except for branches or data hops where you are forced to wait for the new read data.  Keeping things this short and sweet only comparing address equality for 2-4 tiny chunks of logic cells can accelerate quite  a bit of sequenced code, or code which can loop within the bounds or the tiny cache, or if code runs in a straight line.

(Note I am assuming you can read ahead to fill this tiny cache at full speed & that you have access of able to generate to a read address ahead of time to be able to do this.)
« Last Edit: November 03, 2020, 10:38:28 am by BrianHG »
 

Online SiliconWizardTopic starter

  • Super Contributor
  • ***
  • Posts: 15198
  • Country: fr
Re: 5-stage pipelined CPU and synchronous memory access
« Reply #12 on: November 05, 2020, 05:11:50 pm »
The idea was to have a read ahead into async 2 pipe shift logic cell.  Write going out does the same thing except if there is a match to an address in 1 of the 2 logic cell read pipe, that written data replaces that '1 word cache' early read as well.

The reason for knowing if you are accessing op-code or data is that you have 2 of these read-word and write word pipes separate.  The scale of these pipes change with bit width and if you are accessing DRAM, your burst size.  This is a bare-bottom cache which should allow streaming access except for branches or data hops where you are forced to wait for the new read data.  Keeping things this short and sweet only comparing address equality for 2-4 tiny chunks of logic cells can accelerate quite  a bit of sequenced code, or code which can loop within the bounds or the tiny cache, or if code runs in a straight line.

(Note I am assuming you can read ahead to fill this tiny cache at full speed & that you have access of able to generate to a read address ahead of time to be able to do this.)

Thanks for that suggestion. I'll have to give it some thought.
 

Online BrianHG

  • Super Contributor
  • ***
  • Posts: 8031
  • Country: ca
Re: 5-stage pipelined CPU and synchronous memory access
« Reply #13 on: November 05, 2020, 06:01:59 pm »
The idea was to have a read ahead into async 2 pipe shift logic cell.  Write going out does the same thing except if there is a match to an address in 1 of the 2 logic cell read pipe, that written data replaces that '1 word cache' early read as well.

The reason for knowing if you are accessing op-code or data is that you have 2 of these read-word and write word pipes separate.  The scale of these pipes change with bit width and if you are accessing DRAM, your burst size.  This is a bare-bottom cache which should allow streaming access except for branches or data hops where you are forced to wait for the new read data.  Keeping things this short and sweet only comparing address equality for 2-4 tiny chunks of logic cells can accelerate quite  a bit of sequenced code, or code which can loop within the bounds or the tiny cache, or if code runs in a straight line.

(Note I am assuming you can read ahead to fill this tiny cache at full speed & that you have access of able to generate to a read address ahead of time to be able to do this.)

Thanks for that suggestion. I'll have to give it some thought.
I used the concept on my home-made MCU and pixel writer & geometry plotter on the FPGA VGA Controller for 8-bit computer.  For many memory hungry or sequence like DSP blitter copies it's almost completely saturating the ram bandwidth with a penalty re-read @ new address at the end of each loop.  Separate instruction and data cache channels, even though 2 words each on write and read data minimizes the loop penalty as moving & processing the huge chunk of image data becomes more important than the 1 instruction branch penalty when all mixed together.

I'm hoping the same algorithm/concept will also benefit the newly added Hyperbus PSRAM to a point where it will perform almost as good as direct DDR ram on the new PCB.
 

Online asmi

  • Super Contributor
  • ***
  • Posts: 2782
  • Country: ca
Re: 5-stage pipelined CPU and synchronous memory access
« Reply #14 on: November 05, 2020, 09:12:18 pm »
Have you tried enabling retiming and seeing if it allows it to close timing?

Online SiliconWizardTopic starter

  • Super Contributor
  • ***
  • Posts: 15198
  • Country: fr
Re: 5-stage pipelined CPU and synchronous memory access
« Reply #15 on: November 06, 2020, 04:38:07 pm »
Have you tried enabling retiming and seeing if it allows it to close timing?

I suppose you're talking about the initial state of my design in which I was basically using BRAM non-registered?
Believe me I tried a lot and it gave pretty poor results. Thing is: as far as I know, non-registered BRAM is esentially just latches. It would probably be OK with a small amount of BRAM (I've certainly used BRAM this way before, but with a smaller number of blocks), but since I'm using a relatively large number of BRAM blocks, routing delay becomes extremely problematic. (To get an idea: 128KBytes of instruction mem AND 64KBytes of data mem.)

 

Online asmi

  • Super Contributor
  • ***
  • Posts: 2782
  • Country: ca
Re: 5-stage pipelined CPU and synchronous memory access
« Reply #16 on: November 06, 2020, 06:56:07 pm »
I suppose you're talking about the initial state of my design in which I was basically using BRAM non-registered?
Both. I often use this mode to see if it will give me some ideas about how to better balance combinatorial logic across stages. If I see something worth doing, I then just manually move this logic as suggested. The reason is it's not always obvious which logic chains will become a problem and which ones won't, so I like running retiming every once in a while as a sort of exploration pass. I would never rely on retiming for final designs though, because it's very application- and design-specific, while I prefer my submodules to be more like universal drop-in blocks, which attempt to register all inputs and outputs possible to keep it independent from external connections.

Offline laugensalm

  • Regular Contributor
  • *
  • Posts: 129
  • Country: ch
Re: 5-stage pipelined CPU and synchronous memory access
« Reply #17 on: November 12, 2020, 10:58:50 am »
Not sure if that helps: I ended up using a variable 4-5 stage pipeline for a simple rv32i implementation which calculates memory addresses at the DE stage instead at EX. Since I've got memory responding immediately to READs (1 cycle delay) or one wait state/BUSY condition, the CPU is in 'shortcut' (4 stages) or 'default' (5 stages) state and switches depending on load-use instruction sequence order (and branches). This would save one wait cycle upon immediate read for the load-use scenario. The variable pipe length of course requires some bypass/delay logic which initially was considered *nasty*. It was more of an experimental thing I've run into, eventually I've just left it at that as it performed well enough and didn't cost too much of extra resources. Also I didn't tweak anything on the compiler side to make it omit optimum code, as the stall penalties didn't appear too bad.

It's not much documented, but somewhat elaborated with the f_max and resource usage results here: https://section5.ch/index.php/2019/09/27/risc-v-experiments/. Corresponding GHDL-powered (thus wave-traceable) virtual machine as Docker environment: https://hub.docker.com/r/hackfin/masocist/
 

Offline NorthGuy

  • Super Contributor
  • ***
  • Posts: 3238
  • Country: ca
Re: 5-stage pipelined CPU and synchronous memory access
« Reply #18 on: November 12, 2020, 03:23:46 pm »
Physically, there are different ways to register BRAM.

Xilinx 7-series BRAM requires one clock to change the BRAM output. It takes roughly 3 ns for the output to propagate to nearest CLBs.

Since their BRAM can run at 400-500 MHz, 3 ns is far too long. Therefore they have internal flops in their BRAM, which split that delay - most of the delay is inside the BRAM block, but there's still some delay left to route the output from internal registers to outside.

If you run at 400 MHz, there's no other choice than using internal BRAM registers.

If you run at 200 MHz, you'll be better off using CLB's flops rather than BRAM flops. This is because you still have time to deliver the BRAM output to CLB, but you'll save some time on the next cycle by using CLB's flip-flops which will have shorter launching delays and possibly better routing.

If you run at 100 MHz, you have plenty of time and you can add lots of combinatorial logic (such as your sign-extending logic) to the BRAM output before you register it.

I have no idea how to deal with this in general platform-independent way.
 

Online SiliconWizardTopic starter

  • Super Contributor
  • ***
  • Posts: 15198
  • Country: fr
Re: 5-stage pipelined CPU and synchronous memory access
« Reply #19 on: November 12, 2020, 05:38:55 pm »
Physically, there are different ways to register BRAM.

Xilinx 7-series BRAM requires one clock to change the BRAM output. It takes roughly 3 ns for the output to propagate to nearest CLBs.

Since their BRAM can run at 400-500 MHz, 3 ns is far too long. Therefore they have internal flops in their BRAM, which split that delay - most of the delay is inside the BRAM block, but there's still some delay left to route the output from internal registers to outside.

If you run at 400 MHz, there's no other choice than using internal BRAM registers.

If you run at 200 MHz, you'll be better off using CLB's flops rather than BRAM flops. This is because you still have time to deliver the BRAM output to CLB, but you'll save some time on the next cycle by using CLB's flip-flops which will have shorter launching delays and possibly better routing.

If you run at 100 MHz, you have plenty of time and you can add lots of combinatorial logic (such as your sign-extending logic) to the BRAM output before you register it.

Thanks for the pointers. I currently run my core at 100 MHz, but I still find some things problematic regarding using BRAM without output registers.
(I have to add I'm using a Spartan-6 board at the moment, thus ISE, for convenience reasons  - the board has all I need, and the boards I have with Artix-7 chips are all with 35 models which have less total BRAM than the Spartan-6 on this particular board...)

Timing analysis seems flawed (at least in ISE) when using BRAM without output registers. I haven't investigated much but basically it looks pretty unreliable. I'll have to test this with Vivado at some point.

I have no idea how to deal with this in general platform-independent way.

Yeah, as you got it, that's kind of my goal to keep my core "generic" enough.
 

Online asmi

  • Super Contributor
  • ***
  • Posts: 2782
  • Country: ca
Re: 5-stage pipelined CPU and synchronous memory access
« Reply #20 on: November 12, 2020, 06:27:16 pm »
I have no idea how to deal with this in general platform-independent way.
Same way you deal with (unknown) peripherals and/or external memory - your code should be designed for the best-case latency (so that it won't choke on data in case the best case is actually achieved), but should always be prepared for stalls in case memory access ends up taking more time than that.

Online SiliconWizardTopic starter

  • Super Contributor
  • ***
  • Posts: 15198
  • Country: fr
Re: 5-stage pipelined CPU and synchronous memory access
« Reply #21 on: November 25, 2020, 03:59:49 pm »
Alright, just a quick follow-up on what I did. For dealing with data memory access, as I said above, I postponed the availability of reads until the write-back stage. So that adds a load-use hazard case, which hinders performance slightly, but otherwise solves issues without having to add wait states. When there is no load-use dependency, there is zero consequence on throughput.

Turns out that there was potentially a similar problem with fetching instructions from instruction memory... and for this, I added a fetch stage, so that makes it a 6-stage pipeline (I did this because of branch prediction, which is designed to get a potentially predicted PC at every clock cycle, and thus lengthens the logic path). Which has an impact on performance only for branches that are not predicted correctly. So that's not too bad.

There are other ways to deal with this of course, probably including implementing small local caches (data and instruction) with zero latency... but for now, the above works and adds very little complexity.
 


Share me

Digg  Facebook  SlashDot  Delicious  Technorati  Twitter  Google  Yahoo
Smf