Products > Computers
The RISC-V ISA discussion
SiliconWizard:
(Foreword: not sure this is the appropriate section, but no other really seems more appropriate so...)
I've been seriously considering and studying the RISC-V ISA lately. I also started developing a cycle-accurate CPU "emulator" (simulator may be a better word?), meant to be useful for testing new ideas, benchmarking, etc. (and the first target is RISC-V, but it won't be limited to that.)
So while taking a closer look at the RISC-V ISA, I have a few remarks/questions... if anyone having worked with it and having experience/insight, that would be great if they could chime in. The discussion can pretty much follow on many other aspects of it. Thought this could be interesting.
My first remarks:
1. Looks like a very nice "exercise" in simplicity. I like the "minimalist" approach. Makes implementing it pretty straightforward.
2. The minimalism looks a bit too much to me on some points. A couple examples:
2.1. The "bit manipulation" extension is not part of the base ISA. I personally think this decision is a bit too drastic. Bit manipulation can definitely be pretty useful in many cases (I'm thinking of some instruction akin to "clz" for instance... or byte swaps, bit reverse, etc.) Could be debated, but what's worse, this "B" extension is not even defined yet. I really think this is a problem at this point, because it's (in my eyes) part of basic operations and even if it's an extension, it should have been defined already IMO. As it is, core designers are likely to define their own extensions with this, and this is going to lead to useless fragmentation for something that again, seems basic to me.
2.2. The "no flag register" approach is interesting, but it makes some operations pretty clunky. For instance, working with integers wider than the native ISA width. No "add with carry" or anything like this. Would be interesting to see how you guys (experienced with RISC-V) would implement it and how much more efficient (or not?) it would be with at least additional operations with carry. You may say this could be a further extension, but again I think this is pretty basic?
I'll probably have tons of other remarks and questions later on, but I'd be interested in reading opinions on these first two to begin with.
ataradov:
Well, what is in the main set and what is an extension is a matter of preference. If you include everything into the basic set, then you will make basic implementations of the ISA much harder. And I personally appreciate the simplicity and ease of implementation of the basic set.
There are no advanced instructions in Cortex-M0+ either. You just go to the higher end core when you need them. Same with RISC-V, you go to a core that also implements an extension.
Yes, it sucks that extensions sit undefined for years. I believe the main stopping point here is lack of confidence in that specific implementation or a set of instructions are good. Hopefully that with more and more RISC-V devices appearing, there will be more push to standardize things.
Add with carry can be somewhat efficiency implemented using SLTU and likes. It is not as efficient as far as the number of instructions goes, but if you are doing optimized micro architecture, it makes for a much easier implementation.
brucehoult:
I think it's pretty amazing that despite being fairly minimalist with only 37 instructions a compiler will generate from C code (so leaving out fence, system call, debugger call, the CSR instructions), RV32I has everything necessary to efficiently support a modern software stack.
You could even make it a bit more minimalist without any great harm. I'd suggest, for example, leaving out all the "immediate" instructions except addi. Boom! You're now down to 29 instructions. And you've freed up 3% of the opcode space at a stroke. The cost? One instruction to load the desired immediate value into a register and then use the register-to-register version of the instruction instead.
Here are some instruction frequency stats I gathered from the RISC-V Debian distro with the standard packages an an assortment of extras. Format: percentage of total instructions, mnemonic, raw instruction count. I've listed the top 16 instructions in full, but only the full register immediate ones after that.
16.224593 addi 2528047
15.237536 jal 2374248
11.123998 auipc 1733294
9.981167 ld 1555223
6.658275 beq 1037464
4.305509 bne 670866
3.687067 sd 574503
3.418121 lbu 532597
3.376591 jalr 526126
2.435197 lw 379442
2.357368 lui 367315
1.800274 sb 280511
1.768576 addiw 275572
1.472592 sw 229453
1.430902 slli 222957
1.314809 andi 204868
:
0.433172 srli 67495
0.296973 ori 46273
0.244295 xori 38065
0.192047 sltiu 29924
0.131027 srai 20416
0.011071 slti 1725
addi is *the* most popular instruction. This happens on other code bases I've looked at as well. On Fedora addi comes in slightly behind jal. Part of this is that addi does triple duty as both the "move register" instruction and the "load immediate" instruction (both of which could be done by other instructions such as ori instead) but incrementing and decrementing loop variables and the stack pointer is anyway so common that addi would always be in the top instructions. This is 64 bit code, so addiw also makes a showing. If you want to think about RV32I then probably just lump addi and addiw together and call it 18%.
What about the others? slli+andi+srli+ori+xori+sltiu+srai+slti together come to 4.05% of all instructions. That's more than the 3.125% of the opcode space they take up (along with addi), but not a lot more. If you left them all out then RISC-V programs would get at most 4% bigger (less, because the same constant could often be loaded once and left in a register, of which there are usually plenty, to be used several times), and probably no more than 1% slower (because the loading of the constant could often be done outside of a loop).
Do I seriously suggest ripping those immediate instructions out of the standard? No, of course not. The standard is ratified :-) And they are carrying their weight, collectively, even if ori, xori, sltiu, srai, slti individually are not. It would also make the hardware *more* complex to disable them, given that the ALU supports those operations, and the data path for immediates from the instruction decoder to the ALU has to exist anyway.
It is however a simple mathematical fact that an immediate instruction takes up 128x more encoding space than the corresponding instruction with two register sources. We can add hundreds and hundreds of R-type instructions in future without problems, but it's going to need a very strong justification to add more immediate instructions -- at least within the 32 bit opcode space. Future 48 bit, 64 bit or longer instructions are a different matter.
I make an exception for the shift instructions. slli, srli, srai don't use the entire 12 bit immediate field, but only enough bits to encode a number up to the register size -- 5 bits for RV32, 6 for RV64. There is room to add more than 100 "shift-like" instructions in the unused all-zero bits of the slli and srli encodings. (srai already uses one of these). The proposal for the BitManip extension adds a number of "shift-like" instructions with immediate versions e.g. sloi, sroi, rori, grevi, gorci.
All this does I think demonstrate that while RV32I is fairly minimal, it could be made significantly more minimal without huge harm to code size or speed.
SiliconWizard:
--- Quote from: ataradov on December 27, 2019, 05:27:00 pm ---Well, what is in the main set and what is an extension is a matter of preference. If you include everything into the basic set, then you will make basic implementations of the ISA much harder. And I personally appreciate the simplicity and ease of implementation of the basic set.
--- End quote ---
Well of course not everyone will have the same opinion of what should be the minimal set, and it will largely depend on the kind of code they tend to work on.
The RISC-V idea is to put a minimal subset in the base set (I/E), which you can implement everything with (except for the more hardware-related extensions such as"A"). Additional extensions (except again the hardware-related ones) are for performance only. You can absolutely implement FP in software with RV32/64I for instance.
So yes it all comes down to what you consider important for performance or not. As I said, for instance I wouldn't have a problem with bit manipulation having its own extension (although I may have done it differently, but that's preferences as you said). I just think it's past time it would get defined. I understand the whole idea of statistically evaluating the use of given instructions and decide which ones to include based on that, but I also think this approach is not without flaws.
--- Quote from: ataradov on December 27, 2019, 05:27:00 pm ---There are no advanced instructions in Cortex-M0+ either. You just go to the higher end core when you need them. Same with RISC-V, you go to a core that also implements an extension.
--- End quote ---
OK, but the difference here is not as drastic. Cortex M0 has (I don't know the difference with M0+? is the IS smaller than in the M0?) "add with carry" instructions and clz (I think), for instance, which were in question here.
--- Quote from: ataradov on December 27, 2019, 05:27:00 pm ---Yes, it sucks that extensions sit undefined for years. I believe the main stopping point here is lack of confidence in that specific implementation or a set of instructions are good. Hopefully that with more and more RISC-V devices appearing, there will be more push to standardize things.
--- End quote ---
Certainly. I don't quite know how priorities at the RISC-V Foundation level are defined though. I'd be interested in understanding what drives them. I'd suspect that they are largely influenced by the "main" big members.
--- Quote from: ataradov on December 27, 2019, 05:27:00 pm ---Add with carry can be somewhat efficiency implemented using SLTU and likes. It is not as efficient as far as the number of instructions goes, but if you are doing optimized micro architecture, it makes for a much easier implementation.
--- End quote ---
Well, sure it would use 'sltu'. As to much easier to implement... this seems slightly exxagerated. Handling a carry flag is pretty cheap IMO. You get it for almost no added cost with pretty much any multi-bit adder. Adding a couple instructions (which would be derivatives of normal add anyway) wouldn't massively hurt anything either.
Just a small example.
Consider the very simple code below, compiled for a 32-bit target:
--- Code: ---uint64_t Add64(uint64_t n1, uint64_t n2)
{
return n1 + n2;
}
--- End code ---
RV32I:
--- Code: --- mv a5,a0
add a0,a0,a2
sltu a5,a0,a5
add a1,a1,a3
add a1,a5,a1
ret
--- End code ---
NanoMIPS (you can see that it's almost exactly the same as with RV32I):
--- Code: --- addu $a2,$a0,$a2
addu $a1,$a1,$a3
sltu $a4,$a2,$a0
move $a0,$a2
addu $a1,$a4,$a1
jrc $ra
--- End code ---
ARM Cortex-M4:
--- Code: --- adds r0, r0, r2
adc r1, r3, r1
bx lr
--- End code ---
ARM Cortex-M0 (don't know the difference between adc and adcs, but it seems pretty equivalent to -M4):
--- Code: --- adds r0, r0, r2
adcs r1, r1, r3
bx lr
--- End code ---
It's basically 5 instructions (not counting ret) for RV32I (and interestingly NanoMIPS, which looks pretty close anyway - not that surprising), and 2 for Cortex-M0 and -M4.
No matter how efficient your implementation is, it's hard to beat that. If you're using a lot of large integer operations in some code, it'll make a pretty significant difference.
Not to mention that beyond code size (which can be mitigated using compressed instructions), you potentially get additional performance issues if you need more instructions to do the same operation. Data hazards are a lot more likely to occur between successive instructions and may not all be solvable without stalling the pipeline...
ataradov:
Adding carry and other flags has significant implications on the hardware design.
Having separate flags introduces additional pipeline hazards, which may make efficient implementation very hard.
The idea here is that microarchitecture will take care of multiple instructions that cam be fused together and executed as one.
Who cares how many instructions there are if they take the same amount of time to execute.
Of course, simplest implementations won't do any of this and will suffer a bit. But you shouldn't design modern architectures for simplest implementations.
Navigation
[0] Message Index
[#] Next page
Go to full version