My surprises show up when initializing periperals. I expected code like:
PORT->Group[0].PINCFG[12].reg |= PORT_PINCFG_DRVSTR;
PORT->Group[0].DIRSET.reg |= 1<<12;
To be implementable with code something like:
ldr r1, =(PORT + <offset of GROUP[0]>)
ldr r2, [r1, #<offset of PINCFG[12]>]
orr r2, #PORT_PINCFG_DRVSTR
str r2, [r1, #<offset of PINCFG[12]>]
ldr r2, [r1, #PORT_DIRSET]
orr r2, #4096
str r2, [r1, #PORT_DIRSET]
Instead, you run into "orr doesn't have immediate arguments any more" and "PINCFG is beyond the range allowed by the [r, #const] encoding", so the code takes an extra 5 instructions and two additional registers. The extra instructions may be a wash with the 32bit forms on the v7m chips, but having to use the extra registers (out of the limited set available) is ... annoying.
I guess there are two options: 1) let the C compiler figure it out, or 2) do something like
ldr r1, =(PORT + <offset of GROUP[0]> + #<offset of PINCFG[12]>)
ldr r2, [r1]
ldr r3, #PORT_PINCFG_DRVSTR
orr r2, r3
str r2, [r1]
ldr r1, =(PORT + <offset of GROUP[0]> + #PORT_DIRSET)
ldr r2, [r1]
ldr r3, #4096
orr r2, r3
str r2, [r1]
One extra register and three extra instructions. And four 32-bit values in a nearby constant poo instead of the three you'd have in ARM/Thumb2 mode, if that code was actually valid (I didn't check too hard)
So:
A32 is a total of 7*4 + 3*4 = 40 bytes
T16 is a total of 10*2 + 4*4 = 36 bytes
Some size savings, but not a lot. I *think* T32 would be the same size as the A32.
Now, what Bruce's example code seems to demonstrate is that the "peripheral initialization" is essentially a degenerate case and that the issues I'm complaining about show up less in the "meat" of a real program. That could be, and it's an interesting result.
Sure. Computations with values that are already in registers are where 16 bit opcodes shine. That's equally true with PDP11, M68k, Thumb1, RISC-V C, MSP430, SH4. Or even x86 with opcode + ModR/M byte for reg-reg opertions, until it starts needing prefix bytes to set the operand size.
(I was impressed by the RV32i summary that was posted, WRT the impressive array of "immediate" operands. But I haven't looked too carefully to see if it does the things I want.)
12 bit immediates and offsets on everything. It's often enough, but you can't do your #4096 as an immediate (only -2048...+2047 is covered). You can do it as LUI t0, #00001. In general you can make any 32 bit constant with LUI t0,#nnnnn; ADDI t0,t0,#nnn, or any 32-bit offset from the PC with LUIPC t0,#nnnnn;ADDI t0,t0,#nnn. Or you can load or store to any 32 bit absolute or PC-relative address with an LUI or AUIPC followed by a load or store with an offset.
As with ARM, there are assembler pseudo ops like LDR so you don't have to worry about the exact instructions used in a particular case.
RISC-V is allergic to constant pools. They are ok in low end processors, but as soon as you get an instruction cache you have the problem that the constant pools will likely get into the instruction cache, but be useless there. And if you have a data cache then instructions around the constant pool get into the data cache, and are useless there. Maybe the compiler/linker could arrange for the constant pools to be in different cache lines to instructions, but I haven't seen that happen.
So RISC-V, along with MIPS, Alpha, and ARM64 prefers using inline code to load constants, even if it needs several instructions to do it.