Author Topic: The RISC-V ISA discussion (Read 17100 times)

SiliconWizard · « **Reply #100 on:** January 17, 2020, 02:50:26 am »

Quote from: brucehoult on January 16, 2020, 05:44:39 pm

It's important to be able to try the same instruction again, especially for things such as page faults, but there are also other kinds of things that you can fix up and retry.

It's much much easier to figure out the instruction length and skip it before returning than it would be to try to work backwards to find the start of the previous instruction.
(...)

Yes, I realized mepc is writable... so you can change it to whatever fits your requirements before returning.

Now it makes me think about something... I suppose the "atomicity" of operations on the CSRs means that, in the case you write to a CSR, in a pipelined core, you'll need to flush the pipeline before executing the instruction after the write, otherwise you may get data hazards, and I suppose that bypassing CSRs would be completely out of the question due to their number.

brucehoult · « **Reply #101 on:** January 17, 2020, 03:09:09 am »

Yes, the E31, S51, U54 manuals all state:

"Most CSR writes result in a pipeline flush with a five-cycle penalty."

On the 7-series it's a seven-cycle penalty.

SiliconWizard · « **Reply #102 on:** January 20, 2020, 03:55:35 pm »

Exceptions/traps are now implemented. That was not as easy as I expected. Exceptions in a pipelined CPU are a bit of a bitch.
For anyone interested, here is the small first code I used to test them:

Code: [Select]

_start:
	/* Testing Exceptions. */
	la a1, ExceptionHandler
	csrw mtvec, a1
	
	li a0, 10			/* Number of times the exception will be raised */
	
	rdcycle s0
	
	sret				/* This should raise an 'illegal instruction' exception here */
	
	rdcycle s1
	sub a1, s1, s0		/* a1 <- number of elapsed cycles for the repeated exception */
	
	ebreak

ExceptionHandler:
	addi a0, a0, -1
	bnez a0, EH_End
	csrr s2, mepc
	addi s2, s2, 4
	csrw mepc, s2
EH_End:
	mret

So, it basically forces an exception through the use of an instruction that is illegal where it is (sret). To better exercise the whole exception handling and the pipeline, the exception handler will return to the same instruction 10 times and finally will return to the next instruction (adding 4 to mepc), which will eventually finish execution. The total number of cycles for the 10 exceptions is counted with rdcycle. In the end, I get 122 cycles for the 10 exceptions.

During the process, I've been wondering about nested exceptions. Whereas interrupts are automatically disabled upon trapping, so you get a chance to save mepc/mstatus before re-enabling them if you want to implement nested interrupts, there's no such provision for exceptions, which means that if an exception is raised in an exception handler *before* you get a chance to save mepc/mstatus (maybe unlikely, but not impossible), the situation will be unrecoverable. I've read a couple articles on this that confirm the issue. I'm wondering how that should be handled?

brucehoult · « **Reply #103 on:** January 20, 2020, 05:53:05 pm »

Quote from: SiliconWizard on January 20, 2020, 03:55:35 pm

During the process, I've been wondering about nested exceptions. Whereas interrupts are automatically disabled upon trapping, so you get a chance to save mepc/mstatus before re-enabling them if you want to implement nested interrupts, there's no such provision for exceptions, which means that if an exception is raised in an exception handler *before* you get a chance to save mepc/mstatus (maybe unlikely, but not impossible), the situation will be unrecoverable. I've read a couple articles on this that confirm the issue. I'm wondering how that should be handled?

Be very careful writing exception handlers :-) Or at least the first few instructions of them. Preferably use a shared and well debugged entry sequence or copy and paste it from a reference source.

Bugs in M-mode software are always very very bad. That's why we write most code in U-mode, when it exists.

SiliconWizard · « **Reply #104 on:** January 20, 2020, 07:02:54 pm »

Quote from: brucehoult on January 20, 2020, 05:53:05 pm

Quote from: SiliconWizard on January 20, 2020, 03:55:35 pm
During the process, I've been wondering about nested exceptions. Whereas interrupts are automatically disabled upon trapping, so you get a chance to save mepc/mstatus before re-enabling them if you want to implement nested interrupts, there's no such provision for exceptions, which means that if an exception is raised in an exception handler *before* you get a chance to save mepc/mstatus (maybe unlikely, but not impossible), the situation will be unrecoverable. I've read a couple articles on this that confirm the issue. I'm wondering how that should be handled?

Be very careful writing exception handlers :-) Or at least the first few instructions of them. Preferably use a shared and well debugged entry sequence or copy and paste it from a reference source.

Ahah, yep. But the exception could be caused not by a software bug, but something else. Even if it's kind of a corner case, it doesn't make me feel fully comfortable.

Some CPUs handle this case as "double faults" (which usually either stops the CPU completely, or jumps to a predefined address.) I'm considering adding this.

brucehoult · « **Reply #105 on:** January 20, 2020, 07:34:08 pm »

If you swap mscratch with SP (or some other register) (which is a single instruction), decrement it by the size of some registers, save the registers there (where the value kept in mscratch is a pointer to preferably some small SRAM area), and grab some or all of mstatus/mepc/mcause/mtval .. and something goes wrong with this ... your basic CPU core is pretty seriously borked.

Taking a trap to some other vector is probably not going to help. Shutting down or looping forever is almost certainly the best you can do.

SiliconWizard · « **Reply #106 on:** January 21, 2020, 03:01:02 pm »

Quote from: brucehoult on January 20, 2020, 07:34:08 pm

If you swap mscratch with SP (or some other register) (which is a single instruction), decrement it by the size of some registers, save the registers there (where the value kept in mscratch is a pointer to preferably some small SRAM area), and grab some or all of mstatus/mepc/mcause/mtval .. and something goes wrong with this ... your basic CPU core is pretty seriously borked.

Taking a trap to some other vector is probably not going to help. Shutting down or looping forever is almost certainly the best you can do.

I'll think about it. Several CPUs actually implement what I mentioned. Some (at least x86, which you may not label as a good architecture

) actually has a provision for double fault handlers, and triple faults basically generate a CPU reset.

Stopping or preferrably resetting the CPU, even in case of double faults, looks like a simple and reasonable way to handle it, so I'll implement that first. (Not doing anything about it if it ever happens still doesn't look like a good idea to me. The infamous "it should never happen" often leads to catastrophic events.

)

So what I'm thinking is implementing a number of "reset sources", including watchdogs and double faults (nested exceptions). I may make all of them optional through a custom CSR. Or not.

SiliconWizard · « **Reply #107 on:** January 28, 2020, 11:29:16 pm »

I've added interrupt support, and also added peripheral simulation in separate thread(s) so 1/ peripheral simulation doesn't hinder the overall simulation speed and 2/ it's easier to properly simulate asynchronous events.

I'll give one of the small tests (assembly) I devised (with just some constants slightly renamed), again for anyone interested. It uses the machine timer (that I've simulated), and will interrupt a loop 10 times, then the interrupts get disabled and it's done.

Code: [Select]

	/* Testing Interruptions (Machine Timer). */
	
	/* Set exception handler */
	la a1, ExceptionHandler
	csrw mtvec, a1
	
	/* Disable Machine-level interrupts */
	li t0, CSR_MIE_MTIE
	csrc mie, t0
	
	/* Set mtimecmp to mtime + PERIOD (next interrupt) */
	li a1, MTIME_ADDR
	lw a2, 0(a1)
	addi a2, a2, PERIOD
	li a1, MTIMECMP_ADDR
	sw a2, 0(a1)
	
	li a0, 10			/* Number of times the Machine Timer interrupt will be handled */
	
	/* Enable Machine-level interrupts */
	csrs mie, t0
	csrsi mstatus, CSR_MSTATUS_MIE
	
	rdcycle s0
	
Loop:
	bnez a0, Loop		/* Loop until a0 == 0, which means the interrupt has fired 10 times */
	
	rdcycle s1
	
	/* Disable Machine-level interrupts */
	csrc mie, t0
	
	sub a1, s1, s0		/* a1 <- number of elapsed cycles */
	
	ebreak

ExceptionHandler:
	csrr s2, mcause
	bgez s2, EH_End		/* Not an interrupt */
	
	addi a0, a0, -1
	
	/* Set mtimecmp to mtime + PERIOD (next interrupt) */
	li a1, MTIME_ADDR
	lw a2, 0(a1)
	addi a2, a2, PERIOD
	li a1, MTIMECMP_ADDR
	sw a2, 0(a1)
	
EH_End:
	mret

One of the next steps will be to translate the simulated core to HDL. But before I do that, I'd like to get some kind of branch prediction done. I'm suspecting branch prediction might need some changes in the pipeline that I don't want to have to translate several times... so I'd like to get this right first.

So yeah, I'm currently studying branch prediction and trying to choose something effective enough, yet simple. For both simplicity and security, I am ruling out anything that would lead to instructions getting to the execution stage or later, and then "undo" things if said instructions are not to be executed. Way too slippery for my taste, so the latest stage I'll let them go is the decode stage. (Not trying to go for high performance either, just something reasonable.) Speculative execution, we all know where that can lead. Of course since I'm designing a 5-stage pipeline and don't really intend on pushing it deeper, it's doable. When you have a 20-stage pipeline or deeper, it's pretty hard to avoid speculative execution without harming performance a lot. Not my area.

andersm · « **Reply #108 on:** January 29, 2020, 12:37:26 am »

I'm assuming you're read your Hennessy & Patterson? It has pretty in-depth coverage of branch predictors. Strictly speaking, you've already implemented a static no-branch-taken prediction scheme. A bit more interesting might be statically predicting backwards branches taken, and forwards branches not taken, which was used in some old CPUs. However, IIRC all static schemes turns out to add very little performance, and you really need something dynamic if it's going to be useful.

SiliconWizard · « **Reply #109 on:** January 29, 2020, 02:11:59 am »

Quote from: andersm on January 29, 2020, 12:37:26 am

I'm assuming you're read your Hennessy & Patterson? It has pretty in-depth coverage of branch predictors.

I've read parts of it, and quite a few articles on the topic.

Quote from: andersm on January 29, 2020, 12:37:26 am

Strictly speaking, you've already implemented a static no-branch-taken prediction scheme.

Yes, that's what I noted a few posts earlier. On "general" code, it's actually not that bad. Of course it falls short on tight loops.

Quote from: andersm on January 29, 2020, 12:37:26 am

A bit more interesting might be statically predicting backwards branches taken, and forwards branches not taken, which was used in some old CPUs. However, IIRC all static schemes turns out to add very little performance, and you really need something dynamic if it's going to be useful.

Yes I realize that. What I'm really wondering about though, and that I'll be able to test in the simulator before even considering implementing it in HDL, is the impact of a good branch predictor in a relatively short pipeline. Given that branches are taken at the execute stage in my 5-stage pipeline, the penalty is relatively small (I lose 2 cycles for taken branches), so it becomes significant only for tight loops/series of branches with just a few instructions in between. A typical worst case is a branch instruction looping to itself: it will have a CPI = 3, whereas it could be just 1 with even a simple branch predictor. But with "real-life" code, the impact is much less on average. So I need to figure out whether it's even worth it compared to how much additional logic it would take (and the potential bugs), knowing that I'm not really looking to design high-performance cores, but just something reasonable.

hamster_nz · « **Reply #110 on:** February 03, 2020, 03:39:52 am »

I've just found "Clarvi" - https://github.com/ucam-comparch/clarvi

It's by the Computer Architecture Group, University of Cambridge Computer Laboratory, and looks to be a very nice implementation to refer to. As per the README.md:

Quote

This is an RV32I core. It provides a minimum implementation of the v1.9 RISC-V privileged specification, including full support for interrupts and exceptions.

Lines 383 to 470 of clarvi.sv is the core of the trap/interrupt/exception handling, which has proven to be quite enlightening.

SiliconWizard · « **Reply #111 on:** February 06, 2020, 08:18:08 pm »

I've now completed a basic dynamic branch predictor with a classic 2-bit saturating counter, 2^n entries indexed by the n LSBs of the PC of the branch instructions (well, actually of the PC shifted right by 1 bit if the C extension is supported, and 2 bits otherwise, since those bits would always be zero). It's a tagged table, so the whole PC address is checked during prediction, and it also stores the target address.

I've tested it with a number of test and benchmark code. On CoreMark, I get the following results:
- Branch predictor disabled: ~2.56 CoreMark/MHz, average CPI = 1.313
- Branch predictor enabled: ~3.00 CoreMark/MHz, 89.0% of correct branch predictions (pretty typical for a bimodal predictor), 0.6% of mispredicted target addresses, average CPI = 1.112

Pretty happy with the results already.

I haven't implemented a return stack buffer yet, so "returns" are very likely to have a mispredicted target address. CoreMark doesn't make heavy use of function calls here. But I've tested it on a very simple, yet very taxing test for function calling (the Takeuchi function), and I get the following: (called with: Takeuchi(18, 12, 6), which BTW should return 7. If not, your compiler or processor is borked.)

- Branch predictor disabled: average CPI = 1.243
- Branch predictor enabled: 61.8% of correct branch predictions, 27.2% of mispredicted target addresses, average CPI = 1.111

Interesting stuff.

For anyone curious, this is the Takeuchi function:

Code: [Select]

int Takeuchi(int x, int y, int z)
{
	if (x > y)
		return Takeuchi(Takeuchi(x - 1, y, z), Takeuchi(y - 1, z, x), Takeuchi(z - 1, x, y));
	else
		return z;
}

Next step, I'm going to implement some kind of return stack buffer in the predictor and see if this is worth it.
But apart from particular cases like this, I get an average speed-up (on typical code) of 15%-20%. Not too shabby.

hamster_nz · « **Reply #112 on:** February 06, 2020, 08:37:04 pm »

Quote from: SiliconWizard on February 06, 2020, 08:18:08 pm

I've now completed a basic dynamic branch predictor with a classic 2-bit saturating counter, 2^n entries indexed by the n LSBs of the PC of the branch instructions (well, actually of the PC shifted right by 1 bit if the C extension is supported, and 2 bits otherwise, since those bits would always be zero). It's a tagged table, so the whole PC address is checked during prediction, and it also stores the target address.

What size values of 'n' are you using?

Do the tags help much?

If you don't have tags, then the you might pollute the branch predictor tables when two branches have the same LSBs. But even then they will only fight half the time (as if both branches are usually taken in both cases (or both not taken) the prediction is correct).

However, if you do have tags then every time the branch predictor throws out any valid information and replaces it with the guessed "weakly taken" or "weakly not taken" state.

Wouldn't the bits be better spent making the branch predictor table larger, reducing the chance of a collision?

Using stale data (from an older branch at a matching address) only requires two misses to change the behavior, so doesn't seem that expensive, and the extra simplicity might be helpful.

Another question... When you first hit a branch, are you guessing that backwards jumps are "weakly taken", and forward jumps are "weakly not taken"?

(Asking because I'm looking into branch prediction at the moment....)

SiliconWizard · « **Reply #113 on:** February 06, 2020, 09:56:42 pm »

Quote from: hamster_nz on February 06, 2020, 08:37:04 pm

Quote from: SiliconWizard on February 06, 2020, 08:18:08 pm
I've now completed a basic dynamic branch predictor with a classic 2-bit saturating counter, 2^n entries indexed by the n LSBs of the PC of the branch instructions (well, actually of the PC shifted right by 1 bit if the C extension is supported, and 2 bits otherwise, since those bits would always be zero). It's a tagged table, so the whole PC address is checked during prediction, and it also stores the target address.

What size values of 'n' are you using?

At this point, 10 bits (so 1024 entries). I've experimented a bit, and there was still a significant improvement going up to 10. But I've found that beyond 10 bits, that was completely negligible. But that's for a tagged version of the predictor. So far, I haven't experimented a lot without tags. Of course to get a better idea, I'd probably have to evaluate it on a much larger number of test cases...

Quote from: hamster_nz on February 06, 2020, 08:37:04 pm

Do the tags help much?

For up to 2^10 entries, definitely, at least in my first experiments. Without tags, on anything but the smallest pieces of code, the misprediction rate would increase a lot, and it seems that it's often a lot worse than just static prediction.

I haven't tested for a much larger number of entries, so I can't tell where the sweet spot would lie. I'd have to experiment more.

Keep in mind that the predictor must predict not only the direction (taken or not taken) but also the target address. Some studies have shown that direction itself could be predicted pretty well without tags (and the first, simple predictors only did that, I think) and a relatively small number of entries, but target addresses are another story.

Quote from: hamster_nz on February 06, 2020, 08:37:04 pm

Wouldn't the bits be better spent making the branch predictor table larger, reducing the chance of a collision?

It doesn't look like it from my (again limited) testing so far. Besides, even though that would make the table smaller, having to deal with more entries (meaning, having to address more lines) may make it slower to access (not sure)? And we're still talking about relatively small buffers here (a few KBytes). I don't think that'd be worth optimizing unless you were going for a very low-area core (in which case, I don't think I would go for a 5-stage pipeline anyway...)

But I may experiment further with this.

Quote from: hamster_nz on February 06, 2020, 08:37:04 pm

Another question... When you first hit a branch, are you guessing that backwards jumps are "weakly taken", and forward jumps are "weakly not taken"?

Nope. When a branch is not predicted yet, it defaults to the static behavior of "branch not taken" - meaning it just fetches the next instruction.
I had thought about doing just what you say, but I'm really not sure it's worth it in practice once you have a dynamic predictor, and it would still require an extra address comparator and more...

As I said, I'm not looking to design the best prediction possible (Intel does that pretty well, although it's not always without faults). What I'm getting at this point is pretty satisfactory. The only thing that might add some more speed-up would be to improve indirect branches such as returns from calls (although I've read about exploits with return stack buffers - see SpectreRSB - so I'm also evaluating what is safe to implement in terms of security and what is not.)

If you're going for a different approach, don't hesitate to report back so we can see what really makes a difference and what doesn't.

hamster_nz · « **Reply #114 on:** February 06, 2020, 10:23:33 pm »

Quote from: SiliconWizard on February 06, 2020, 09:56:42 pm

Keep in mind that the predictor must predict not only the direction (taken or not taken) but also the target address.

Thinking out loud, for the RISCV ISA, is this only true for JALR?

JAL is always taken, and along with BEQ, BNE, BLT, BGE, BLTU and BGE are all immediate offsets from the current instruction's address so the target address can be calculated during the decode?

But JALR is a big problem as it is the call and return instruction, and for implementing jump tables and so on.

SiliconWizard · « **Reply #115 on:** February 06, 2020, 10:46:34 pm »

Quote from: hamster_nz on February 06, 2020, 10:23:33 pm

Quote from: SiliconWizard on February 06, 2020, 09:56:42 pm
Keep in mind that the predictor must predict not only the direction (taken or not taken) but also the target address.

Thinking out loud, for the RISCV ISA, is this only true for JALR?

JAL is always taken, and along with BEQ, BNE, BLT, BGE, BLTU and BGE are all immediate offsets from the current instruction's address so the target address can be calculated during the decode?

Technically yes, but this is not how I have implemented things. The target address for all branches is calculated at the execute stage. I've tried to limit the number of special cases to make things easier to implement and easier to validate. Besides, in a 5-stage pipeline, the address of the currently fetched instruction would then potentially depend on the calculated address at the decode stage, which would lengthen the logic paths. All that may not be very favorable for speed.

In the same vein, tags in my case actually help speed as well. This is why. With tags, I'm sure that a matching entry in the table IS a branch. Without tags, this could be any instruction with the same lower n bits, which has a lot higher probability than between purely branch instructions. So without tags, I would actually need to determine that it's a branch instruction early - at the decode stage obviously - which would again add a logic dependency, thus lengthening the logic path. So I'm trading some memory for speed.

Of course all this is heavily dependent on the way you implement your pipeline.

brucehoult · « **Reply #116 on:** February 07, 2020, 02:35:28 am »

Quote from: SiliconWizard on February 06, 2020, 09:56:42 pm

Keep in mind that the predictor must predict not only the direction (taken or not taken) but also the target address. Some studies have shown that direction itself could be predicted pretty well without tags (and the first, simple predictors only did that, I think) and a relatively small number of entries, but target addresses are another story.

The target address is known (if the branch is taken) for everything except JALR (which isn't conditional).

So the set of things for which a branch prediction is needed is completely disjoint from the set of things for which a branch target prediction is needed, so there are usually handled by completely different data structures.

Implementing "gshare" should massively increase the prediction rate. Keep track of the actual direction of the last n conditional branches (in a shift register), XOR this with the low n bits of the PC, and use this to select the counter.

Don't store the actual PC for the tag. Waste of time. It just forces branches that alias to fight each other every time instead of half the time.

hamster_nz · « **Reply #117 on:** February 07, 2020, 03:07:07 am »

Quote from: brucehoult on February 07, 2020, 02:35:28 am

Quote from: SiliconWizard on February 06, 2020, 09:56:42 pm
Keep in mind that the predictor must predict not only the direction (taken or not taken) but also the target address. Some studies have shown that direction itself could be predicted pretty well without tags (and the first, simple predictors only did that, I think) and a relatively small number of entries, but target addresses are another story.

The target address is known (if the branch is taken) for everything except JALR (which isn't conditional).

So the set of things for which a branch prediction is needed is completely disjoint from the set of things for which a branch target prediction is needed, so there are usually handled by completely different data structures.

Implementing "gshare" should massively increase the prediction rate. Keep track of the actual direction of the last n conditional branches (in a shift register), XOR this with the low n bits of the PC, and use this to select the counter.

Don't store the actual PC for the tag. Waste of time. It just forces branches that alias to fight each other every time instead of half the time.

What is surprising me in advanced CPU design at this level is how much is done using "almost correct", "approximate solutions" and "rules of thumb" that work "correctly enough" most of the time and are designed to work with imprecise information (like Two-bit Predictor for branches), because theoretical, more accurate, more correct solutions would be too complex or expensive.

brucehoult · « **Reply #118 on:** February 07, 2020, 04:42:59 am »

Yeah, it's weird that "two" seems to be the optimum number of bits for a counter. More just means it takes too long to unwind the prediction when circumstances change.

The Pentium MMX and Pentium Pro suddenly got massively better branch prediction than either their predecessors or their peers (e.g. PowerPC) -- in the range of 98% or better -- and how it happened was top top secret for some time. It was just XORing the taken/not taken history or recent branches with the PC. Almost too simple. A 1024 entry table needs 2*2^10 + 10 = 2058 bits of storage and that's *all*. (1024 entries with tags needs ... 32k bits? Ok, maybe (2 + (32-10) * 2^10 = 24576.)

SiliconWizard · « **Reply #119 on:** February 07, 2020, 01:49:51 pm »

Quote from: brucehoult on February 07, 2020, 02:35:28 am

Quote from: SiliconWizard on February 06, 2020, 09:56:42 pm
Keep in mind that the predictor must predict not only the direction (taken or not taken) but also the target address. Some studies have shown that direction itself could be predicted pretty well without tags (and the first, simple predictors only did that, I think) and a relatively small number of entries, but target addresses are another story.

The target address is known (if the branch is taken) for everything except JALR (which isn't conditional).

So the set of things for which a branch prediction is needed is completely disjoint from the set of things for which a branch target prediction is needed, so there are usually handled by completely different data structures.

You probably didn't follow what I replied to hamster. There are several reasons I mixed both. (And they are usually, but *not always* handled separately. Whereas I've reasonably studied what is SOA, I'm implementing things with a specific set of requirements here.)

- Implementing a scheme that can be potentially reused for other ISAs without major modifications;
- Handling ALL branches in the same way;
- Security (yes, checking that a prediction actually belongs to the current branch instruction and not to another one will prevent a few potential exploit issues, so it doesn't just marginally improve prediction rate, but has another purpose);
- Simplifying logic (at the expense of more memory) - has benefits for design simplicity, verification AND length of logic paths.

As I implemented it, the branch predictor doesn't just predict branch direction (and target), it does so acting as some kind of cache. So basically, at the fetch stage, the corresponding predictor's table line is fetched, and at the decode stage, we have all info needed for branching without adding dependencies on the instruction decoding itself. It limits logic depth.

Quote from: brucehoult on February 07, 2020, 02:35:28 am

Implementing "gshare" should massively increase the prediction rate.

I've read a few papers and the "massively" looks pretty overrated. A couple papers notably find that on a mix of benchmarks, gshare is slightly better but nothing drastic.

Given the first results I get with my approach, there doesn't seem to be real room for "massive increase" anyway, and I'm not looking to extract the last few %. As I said, I'll leave that to Intel, AMD, ARM, (maybe SiFive?

) Not my area.

At the beginning, I was even considering no dynamic BP at all. With the one I implemented, I get an average of +15-+20% speedup, and even over 10% with degenerate cases such as the Takeuchi function, so I'm not convinced at this point I'm going to try and further improve this. As I said, the only thing I'm considering is a return stack buffer, but even that I'm not completely decided.

My questioning for the real benefit of RSBs is that except for the very small functions (or functions that are called often and return very early, thus not executing much on average), the context saving/restoring on the stack takes a significant number of cycles compared to the return itself, thus a misprediction penalty for returns may be relatively negligible. Of course there are always cases where it would make a difference, but not sure it's worth it.

An illustration of this is with the simple Takeuchi function test case I showed. You can see that even with a relatively poor prediction rate (still better than static), the average CPI is close to 1.1, which is close to the average CPI I get on a mix of benchmarks. Not only do I consider 1.1 a decent figure here (for my requirements anyway), but it seems to show that branch prediction for function calling specifically may be marginal in many cases compared to branch prediction of all other kinds of branches. Don't hesitate to prove me wrong here, but my tests and rationale why seem to hold.

Quote from: brucehoult on February 07, 2020, 02:35:28 am

Don't store the actual PC for the tag. Waste of time. It just forces branches that alias to fight each other every time instead of half the time.

My rationale explained above.

As to performance, I've done some more comparative testing today. Without tags, the prediction rate itself, not surprisingly, very slightly increases (but like marginally - something in the order of +0.1%), but the target prediction rate decreases significantly (several %).

Again, this is to be considered with a specific set of requirements and a specific implementation of the pipeline.

Of course, I'll be interested to see real tests of different approaches, and see how they compare, but raw performance is not the only criterion here.

SiliconWizard · « **Reply #120 on:** February 08, 2020, 04:11:29 pm »

I added a test with the following benchmark: https://people.sc.fsu.edu/~jburkardt/c_src/linpack_bench/linpack_bench.html

I haven't implemented F/D extensions yet, so I tested it with emulated floating point on a RV32IM. A lot of function calls for sure.
Of course it's dead slow with emulated FP.

I get the following:
Avg. CPI = 1.030
Correct branch predictions = 87.8%
Mispredicted branch targets = 4.8%

As to the benchmark itself... I get a whopping ~0.62 MFlops (simulated RV32IM clocked at 100 MHz.)

I'd be curious to see results for this benchmark on ARM Cortex CPUs (also with emulated FP of course), if anyone has some time to spare.

Edit: for anyone feeling like trying: the default N value for this benchmark (1000) makes it pretty slow to execute AND requires a lot of RAM (~ 8MBytes). Unpractical for MCUs unless you have external SDRAM. You can change N to 100, I think 128KBytes of RAM should do it then (take a look at the memory allocations.) It'll be a bit less "accurate", but I get a pretty similar figure, so that'll do it for comparing. Another point: you will probably have to comment out the timestamp() function (not useful here and uses time functions probably not available on a typical MCU), and port the cpu_time() function. That's all.

brucehoult · « **Reply #121 on:** February 08, 2020, 10:47:57 pm »

Have you tried my primes benchmark?

http://hoult.org/primes.txt

90% correct branch predictions is of course 90% the maximum possible gain, so is fine for a simple processor.

The problem is once you have an out-of-order processor with say ~200 instructions decoded and dispatched into the execution engine, and there is a branch every five or six instructions on average. You have to have very close to perfect branch prediction to have any chance of those 200 instructions being the right 200 instructions.

SiliconWizard · « **Reply #122 on:** February 09, 2020, 12:05:17 am »

Quote from: brucehoult on February 08, 2020, 10:47:57 pm

Have you tried my primes benchmark?

http://hoult.org/primes.txt

Yes I have! I get surprisingly good results with it: 97.7% correctly predicted branches.
I get an avg. CPI = 1.235, worse than what I get with most of my other tests though which are now closer to 1.11. I suspect it's mostly explained by stalls due to more load-use hazards than usual though. I haven't implemented a specific instrumentation counter for those, but I will - that will confirm or infirm this.

Quote from: brucehoult on February 08, 2020, 10:47:57 pm

90% correct branch predictions is of course 90% the maximum possible gain, so is fine for a simple processor.

The problem is once you have an out-of-order processor with say ~200 instructions decoded and dispatched into the execution engine, and there is a branch every five or six instructions on average. You have to have very close to perfect branch prediction to have any chance of those 200 instructions being the right 200 instructions.

I can see how that would become critical for complex OoO CPUs, and even more so with long pipelines. I haven't considered OoO execution yet though, and probably won't.

hamster_nz · « **Reply #123 on:** February 09, 2020, 03:23:43 am »

I found it pretty interesting to think about how the two-bit saturating counter is 'good enough' for branch prediction.

If you have any loop (e.g. iterating over a list, or "for(i=0;i<10;i++) ..." after one or two passes it quickly learns the flow through the code, and when the loop finishes the history of branch prediction doesn't get completely trashed by exiting the loop.

Exiting the loop also removes half of the history (going from 11 to 10, or from 00 to 01), so next time the entry in the table is used it will quickly adapt to the new pattern.

SiliconWizard · « **Reply #124 on:** February 09, 2020, 02:45:25 pm »

I implemented a load-use hazard instrumentation counter, and I can confirm the point with the avg. CPI on Bruce's primes benchmark. I get ~21.5% of load-use hazards (percentage over all executed instructions!). (With CoreMark, for instance, I get ~6.5%. On many other tests I've done, I get less than 5%.)

I've looked at the generated assembly and the culprit seems to lie in this small portion:

Code: [Select]

	lw	t5,0(t1)
	blt	a3,t5,.L7
	lw	a2,0(a6)
	ble	a4,a2,.L9

Two load-use hazards that involve a 1-cycle stall each.

Looking at the whole assembly code (which is pretty small), I can't really figure out a way to better order instructions to avoid this, so I can't really blame GCC. Maybe there is.


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

Author Topic: The RISC-V ISA discussion (Read 17100 times)

Share me