Ok, so back after reading the manual for this chip.
The
__attribute __ ((interrupt("WCH-Interrupt-fast"))) saves 10 registers to the stack. The normal stack in RAM. I'd assuming when hearing about the feature and that it supports 2-deep interrupt nesting that they'd put a 2nd and 3rd register set in the CPU and it would just be BOOM single cycle or something and you're done.
So, actually, saving to the stack in RAM won't be all that much faster than doing the saving in software, and if you have a small interrupt handler that only needs a couple of registers then using the standard RISC-V
__attribute__((intterrupt)) that saves only what it uses will probably be faster.
The 10 registers saved:
x1: ra (only actually useful if your interrupt routine calls other functions)
x5-x7: t0-t2
x10-x15: a0-a5
So, the OP's disassembled code changes: ra, a3, a4, a5, s0, s1
The hardware stacking is needlessly saving and restoring: t0, t1, t2, a0, a1, a2
The hardware stacking is NOT saving: s0, s1
The OP would be just as well off turning off the "fast interrupt" feature and using standard RISC-V
__attribute__((interrupt)).
NB: I'm assuming here that the compiler is using the RV32E ABI that is simply RV32I truncated to 16 registers. I haven't checked the hex opcodes against the disassembled instructions to check.
There is a proposed EABI that should give considerably better code in many cases, mainly by increasing the S registers to 5 instead of 2 (this code, as compiled, really wants to have 3 S registers, and has to re-create the contents of a4 because it can't use an s2 instead) by cutting down to 2 T registers and 4 A registers (the same as 32 bit ARM, so should be enough for ported software).
Standard ABI vs EABI for x5-x15:
t0, t1, t2, s0, s1, a0, a1, a2, a3, a4, a5
t0, s3, s4, s0, s1, a0, a1, a2, a3, s2, t1
Hardware stacking for the EABI would only need to stack x5, x10-x13, x15 i.e. 6 registers instead of 10. Which just goes to show WCH are not using it.
https://github.com/riscv-non-isa/riscv-eabi-spec/blob/master/EABI.adoc