As I already said, my implementation at this point does work. It's just not optimal IMO, and that's what I'm working on. I'm sure I'll find a way to optimize it, and there will certainly be many more opportunities to optimize my whole core later on. I was already happy about it able to run @100MHz on a Spartan 6, a bit less happy about the ~85MHz with the branch predictor enabled, but it was not that bad either, considering it implements RV32IM_Zicsr, exceptions/traps and branch prediction. My goal though is to make the branch predictor not add significant overall delay.
This is good.
Just one point - may have been obvious, but I'm not sure, so here it is. When I talk about "branch prediction", I'm actually talking about both branch prediction and branch target prediction, as I know those are usually formally 2 different concepts - but I need both, and I have implemented both.
Without branch target prediction, what you suggested earlier poses no problem, but there will always be a one-cycle latency if you need to get at the decode stage to figure out the next PC. I don't really see a way around that, and that's usually why branch target buffers are used. So, yeah I have implemented both, and I consider both part of the "branch prediction unit" in my design. That may have been a bit confusing, so here I make it clear. That's also why I talked about PC tags. BTBs are a form of cache and require some form of tags like with any cache to be effective.
So this is mainly a matter of optimizing the implementation of both predictors used in conjunction, and I'm sure I'll figure it out.
I've said all this before but I'll repeat.
1) you don't need a 1 cycle latency to calculate the next PC. You can do just enough decode of the instruction (figure out if it's a conditional branch, extract the offset and add it to the PC) right there in the instruction fetch stage. This will result in a slightly lower maximum MHz, but not much.
2) only JALR instructions logically require prediction of the branch target. They are comparatively rare, with function return being the vast majority. Returns can be handled very well with typically a 2 (FE310) to 6 (FU540) entry return address stack.
3) branch target predictors are *huge*. Each entry needs to store both the current (matching) PC and the next PC, which is 64 bits on RV32. Plus it needs to be a CAM, which is very expensive, adding a comparator for the PC (tag) of every entry. A branch predictor needs 2 bits for each entry and is direct addressed access. You can afford to have at least 32 times more branch predictor entries than branch target entries for the same area / LUTs. Maybe 50x.
4) a return address entry is just as large as a branch target entry, but you only need a few of them to be effective and they don't need to be CAM as you only need to check the top entry. (you *could* CAM it and let any entry match, but that's only going to help malformed programs)
5) yes, you can just use a branch target predictor for conditional branches and JAL as well as for JALR. That will save waiting for the instruction fetch and doing the minimal decode needed. It will need far more resources (or have a lower hit rate than a branch predictor, if it has fewer entries), and most of it will be used by things that aren't JALR.
I mean .. experiment with different things by all means. I don't want to discourage that! And it depends on what your goal is.
For reference (BHT = Branch History Table, aka 2 bits per entry predictor, BTB = predicts the next PC, RAS = Return Address Stack):
FE310-G000: 128 BHT, 40 BTB, 2 RAS
FE310-G002: 512 BHT, 28 BTB, 6 RAS
FU540-C000: 256 BHT, 30 BTB, 6 RAS
ARM A53: 3072 BHT, 256 BTB, 8 RAS
Note that the FU540 taped out almost exactly a year after the FE310-G000, while the FE310-G002 was another year after the FU540.