I've worked on a 5-stage pipelined RISC-V core a while ago, and I'm now back at it to improve things, timings in particular. This thread will be best understood by people familiar with the concepts, as I'm not detailing them much here.
So here is something I'm currently working at. For memory access (load), the typical 5-stage pipeline is like so:
* Execute (EX) stage: compute memory address,
* Memory Access (MA) stage: access memory, sign-extend read value if required,
* Write-back (WB) stage: write result to destination register.
Problem is, when dealing with synchronous memory, it will be read at the MA stage, and thus you can only register the read value at the next clock cycle. Thus, it can't be put into the MA-to-WB register, unless you stall the MA stage for one cycle for ANY memory read. What I had implemented was stalling for one cycle ONLY when there is a load-use hazard, since my pipeline is fully bypassed (data forwarding). (Now of course, keep in mind the problem appears because this is a fully bypassed pipeline. If there were no bypassing, this would be a non-issue.)
This implementation is problematic, because it can't register the read value from memory (+ sign extension!), which gives timing issues. This is not clean.
I'm wondering how this is typically implemented. Looked at a couple soft cores, but coding style varies a lot and they are not always straightforward to figure out at this level of details...
I could of course always stall for one cycle when reading from memory, and for two if there is a load-use hazard. Maybe this is what is commonly done? I had hoped we could avoid this extra stall cycle, and textbooks usually seem to overlook this issue.
Currently my idea to avoid stalling when there is no load-use hazard is to do the following:
* Postpone sign-extension (if required) + registering of the read value at the WB stage,
* Now there are two load-uses cases instead of just one (next instruction AND the instruction after the next), with respectively two and one stall cycle(s).
Benefit is that stalling would be avoided entirely for memory reads if there is no load-use hazard, thus if the next two instructions don't depend on the load. Looks better than stalling for one cycle at least for ALL memory reads.
What's your thought? Is there something simpler that can be done that I overlooked (quite possible)?