Completely agree. Fusion makes you sound smart, but in practice almost nobody does it.
Ironically, one example is that modern high end x86 and ARM cores fuse a compare with an immediately following conditional branch. Ironic because this has always been a single instruction in RISC-V.
SiFive's 7-series dual-issue cores (e.g. the U74 in HiFive Unmatched, Beagle Starlight beta, VisionFive V1, VisionFive 2) link a short forward conditional branch over a single instruction with that following instruction. It's not fusion because they don't become a single instruction. Both instructions proceed down the two pipelines in parallel and at the final stage if the branch turns out to be taken then the other instruction is turned into a NOP (the register or memory write is squashed).
In one very simple example of fusion sometimes given, you have the 32x32->64 multiply that requires two instructions to get the full 64-bit result, and fusion would avoid multiplying twice. The same can be achieved (and this is what I've done) just registering the last multiply operation and outputting the registered result (if a subsequent multiply matches the operands and signedness), which is even better as it doesn't require the two multiply instructions to be directly consecutive, and it's much cheaper to implement than any kind of fusion.
Good optimisation. I'd even speculate that its main impact might even be catching simple repeated multiplication of the same operands, not even 32x32->64 cases!
Of course there are cases where it's less trivial and for which fusion would make more sense, but I'm not sure I have seen of many examples of this, as, as you said, OoO in this case is a more generic approach and works much better overall.
Some of the potential RISC-V instruction fusion candidates have simply been added as official instructions in later extensions. This loses the advantage of code running unchanged on CPUs that don't implement them, but otherwise gains the performance advantages in a simpler way, and also gives code size advantages that fusing wouldn't.
The main example that comes to mind is the Zba extension (part of the group of extensions commonly grouped as "B"): add.uw, sh1add, sh1add.uw, sh2add, sh2add.uw, sh3add, sh3add.uw, slli.uw.
These are various combinations of:
- zero extension of rs1 from 32 to 64 bits (which itself requires a SLLI;SRLI pair in base RVI)
- shift rs1 left by 0,1,2, or 3 bits (or 0..63 for slli.uw)
- add rs1 and rs2
RVI instructions replaced:
2: sh1add, sh2add, sh3add, slli.uw
3: add.uw, sh1add.uw, sh2add.uw, sh3add.uw
Fusion might be reasonable for two instructions, but expecting it for three instructions is getting out of hand. Also, if you don't want to break the pipeline design then fused instruction sequences must all modify the same register (i.e. only modify one register), which would constrain a fused instruction sequence for the above to either have Rd distinct from Rs1 and Rs2 (the usual case) or else the same as Rs1. In the very common case where Rs1 is a loop index this means you'd need to either copy it to a new register first (making the fused sequence 1 instruction longer) or else require the first instruction of the fused sequence to be a 4 byte 3-address opcode. So either way you're looking at 6 or 8 bytes of code you're trying to fuse to get the effect of the above 4 byte instructions.
The reason for all the .uw variants is that it has been discovered that a frightening amount of code in the wild has been "optimised" by using 32 bit "unsigned", "uint32_t" etc variables as loop counters and array indexes in 64 bit code instead of the natural "int" or a 64 bit type such as "long", "unsigned long", "size_t", "ptrdiff_t", "int64_t", "uint64_t".
Legacy 32 bit code of course usually just uses "int". However using a signed 32 bit type such as "int" on amd64 or arm64 is suboptimal because a separate sign-extension step is required, as both ISAs automatically zero-extend 32 bit values to 64 bits. So a lot of people have been going around replacing int with -- not a 64 bit type, which would make sense -- but with a 32 bit unsigned type.
And this *pessimises* RISC-V, which automatically sign-extends 32 bit results to 64 bits. RISC-V code is optimal if people use either the legacy code "int" or any 64 bit type.
Grrrrr.
One of the places this showed up rather badly is in Coremark. Early RISC-V Coremark results typedef'd the offending 32 bit unsigned variables used for array indexing to int. But then ARM and their friends said "that is an illegal modification, you must not change typedefs".
So, sh1add, sh2add, sh3add (or just plain "add" for byte arrays) work for indexes of type int32, int64, uint64 while add.uw, sh1add.uw, sh2add.uw, sh3add.uw work for indexes of type uint32 and all the bases are covered equally efficiently.
The main reason RISC-V chose to automatically sign-extend 32 bit results, by the way, is because you then only need a single set of (64 bit) compare instructions which automatically work on both signed and unsigned 32 bit values as well. If you zero-extend 32 bit results then you need both 64 bit and 32 bit compare instructions (at least for signed compares).