The idea here is that microarchitecture will take care of multiple instructions that cam be fused together and executed as one.
That's gonna be tricky for ADC as the decoder needs to remember which register holds the SLTU-computed carry bit of which addition and then find the ADD instructions that consume this register, which may come in a few permutations, possibly partly before the SLTU itself, and possibly spread out and interleaved with unrelated code if the compiler targets a superscalar core. Sounds fun, I'm not sure if even x86 perform such complex fusions.
I've heard/read about proposals to design RISC-V cores with instruction fusion like this, but have never seen one actually work. It sure sounds like pretty complex to implement correctly, and I'm frankly not convinced it would end up being simpler than just handling 3-source instructions (which again already exists in some RISC-V extensions anyway...)
No one is going to do anything as complex as that.
Instruction fusion is of course a complex thing. The whole point of it is that it allows you to put the complexity of decoding it and the complexity of needing possibly more than two source operands only on the implementation that wants to do that. On others the potentially-fused instructions just execute serially as usual.
That means that one constraint on fusible instruction pairs (or sequences) is that they must modify only one register (or if they are executed fused then all result registers must be properly written with the intermediate values -- which is a whole other level of complexity)
If we rearrange that 64 bit add example a little...
add a0,a0,a2
sltu a5,a0,a2
add a5,a5,a3
add a1,a1,a5
ret
... then the sltu and following add can be fused.
I don't expect CPUs will go looking for fusible pairs that are randomly separated in the instruction stream. If you know you're going to be running on a CPU that fuses certain sequences then you tell the compiler to schedule (and register allocate) so as to generate those pairs.
We know recent Intel x86 CPUs fuse cmp/bCC pairs. They don't do it if they are separated.
SiFive's 7-series cores fuse conditional branches that branch forward over exactly one instruction into effectively predicating that following instruction.
I don't know of other RISC-V cores that implement any instruction fusion yet. It's a high end feature for high performance CPUs and virtually all RISC-V cores actually released until now have been aimed at the low end. Several companies and other organizations are working on high performance RISC-V cores, so we'll find out in due course.
Anyway, it's all an exercise of making a good compromise between simplicity and performance. A pretty tough endeavor that's obviously bound not to please everyone.
Absolutely.
I completely understand the whole idea of having a simple ISA (RISC-V in that regard is very much in the RISC spirit of the early days, whereas most RISC processors now have become monsters and we can question what the "R" means anymore). But putting all the work for performance upon the microarchitecture's shoulders is debatable as well. One point of RISC-V is to make it very easy/and lightweight to implement, but then if we need to design complex microarchitectures to really make it efficient, is the compromise really always worth it? At least it certainly doesn't look as easy as what we may hear here and there...
I think you should look at the situation as it is before performing any such heroics.
Simple RISC-V cores such as the original open source single-issue in-order "Rocket" have both code size and program execution cycles within epsilon (+/- a couple of percent) of Thumb2 cores such as Cortex M0/M3/M4. It depends on the benchmark. Sometimes ARM comes out slightly ahead and sometimes Rocket comes out slightly ahead.
The Rocket cores also come out smaller in silicon area (cheaper, lower energy use) in the same process AND can be clocked higher in the same process. More recent core designs from SiFive, Andes, Western Digital, PULP and others are better again than Rocket.
A lot of people have looked at the FE310 and said "Why on earth would you want to make a microcontroller with only 16 KB of RAM run at 320 MHz?" A fair enough question. The answer is simply that 320 MHz is simply the speed that all of the chips turned out to run at on the very cheap 180nm process (some do 400 MHz), and there seemed to be no marketing reason to artificially limit it. They run perfectly well at 16 MHz or 80 MHz or whatever if you want (and at lower power levels of course).
There doesn't seem to be anything else that comes close to matching RV32IMAC and Thumb2 on all axes of code size, clock cycles, and die area.
NanoMIPS, maybe, which looks very very nice in the manual but it seems basically still-born, which is a great pity. There was an announcement of an RTL core in May 2018, and later of it being licensed to Mediatek, but nothing else since then. It might never see the light of day anywhere else. I'd love to be wrong on that, as it seems to be very nice work.
SH4 and ColdFire and Thumb1 have good code size but having only 2-address instructions loses too much on execution speed unless you go super scalar out-of-order like x86.
Three-address instructions but only 32 bit opcodes loses too much on code size.
Maybe Xtensa's mixed 16 bit and 24 bit instructions comes close. I haven't studied it closely enough to know. I do know RISC-V is getting lots of design wins against Xtensa (and ARC) and also displacing them from existing customers.