Yeah, fact 0 to know about avr-gcc is, it generates
awful code. -O0 is especially verbose, and realize that, by setting this, you are telling the compiler, you
want everything to be finalized into memory, after every statement, use the initial, most literal, most verbose instruction sequence, don't touch it, leave it perfectly alone. So it's going to maximize memory load/store, and dot all the tees and cross all the eyes, for every most trivial of operation. And yeah,
CBI/SBI isn't on the list of standard generation, it's an optimization.
Even on higher settings, it leaves evidence of internal 16-bit operations all over the place (e.g. trivial register swaps that could've been renamed, word-level granularity only i.e. no allocating
chars to odd registers, sign/zero extensions that aren't read, etc.), it's completely(?) ignorant of 16+-bit multiplication (uses library calls), it just uses that much more code than is needed.
Fortunately the ISA is easy to understand, so as much as you're checking compiler output, you can just as well [re]write it yourself. The resources I use are:
https://sourceware.org/binutils/docs/as/index.html - assembler manual itself, don't forget to check the avr-specific section
https://gcc.gnu.org/wiki/avr-gcc - now you can predict where variables are put in registers when a function is called / where they're expected on return
https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html - the best format for inline asm as the compiler can do something with it
If you'd find an example helpful, you may want to take a look at this project:
https://github.com/T3sl4co1l/Reverb/blob/master/asmdsp.S - compare with:
https://github.com/T3sl4co1l/Reverb/blob/master/dsp.cI wrote this by starting with the C functions, as generated, and optimized them down to, less than half cycle count I think. (There's also commented-out versions of helper functions, like
mac32p16p16, which I used to break down the transition a bit, from optimized core math, to total optimized functions.) Overhead on 16-bit MAC sort of arithmetic is pretty awful, so this is something of an exaggerated case. But I was able to do a few things a compiler can't (or, C can't express, or in any case GCC certainly can't) -- like in
dspReverbTaps, I use a 24-bit data type. (I suppose an individual-register-aware compiler would be able to take advantage of the reduced stack usage, but GCC certainly can't.)
I forget anymore if it was this project or another, but occasionally GCC does produce "perfect" code. If it's not the ADC interrupt here, it'd be some timer or ADC interrupt in another project I did, that was basically: read registers and put them in memory. Trivial operation, not many ways you can screw it up, and fortunately it didn't.
Incidentally, there is one way to coax better generation: in another project, I made use of a struct to hold control loop parameters; GCC normally accesses this with offset indirect instructions, which saves a cycle per on XMEGA, plus a whole word over the
LDS it would otherwise emit. This had the form,
ctrlState_t ctrlState; // max 32-word struct containing controller state
...
ISR(TIMER_OVERFLOW_vect) {
ctrlRet_t r; // couple-word struct
r = updateCtrl(&ctrlState);
// set DAC and timer registers to r members
}
To prevent it from inlining the fixed parameter (
&ctrlState), I have the target function set with:
#pragma GCC push_options
#pragma GCC optimize ("-fno-inline")
#pragma GCC optimize ("-Os")
#pragma GCC optimize ("-fno-ipa-cp")
ctrlRet_t updateCtrl(volatile ctrlState_t* s) {
...
}
#pragma GCC pop_options
That's definitely something a compiler should just know to do, and I mean, most CPUs prefer this motif as well, so, beats me how the hell AVR misses out on it?... (Relative to avr-gcc 8.1.0, so, hardly an exemplar.)
Speaking of the mul operations, I refined one further in another project:
/**
* Multiplies two 16-bit integers, with rounding, as an intermediate
* format in 16.16 fixed point, returning the top (integral, 16.0) part.
*/
uint16_t asm_umul16x0p16(uint16_t a, uint16_t b) {
uint16_t acc;
// acc = (((uint32_t)a * (uint32_t)b) + 0x8000ul) >> 16;
__asm__ __volatile__(
"mul %A[argB], %A[argA]\n\t"
"mov r19, r1\n\t"
"mul %B[argB], %B[argA]\n\t"
"movw %A[aAcc], r0\n\t"
"mul %A[argB], %B[argA]\n\t"
"add r19, r0\n\t"
"adc %A[aAcc], r1\n\t"
"eor r18, r18\n\t"
"adc %B[aAcc], r18\n\t"
"mul %B[argB], %A[argA]\n\t"
"adc r19, r0\n\t"
"adc %A[aAcc], r1\n\t"
"adc %B[aAcc], r18\n\t"
"subi r19, 0x80\n\t"
"sbci %A[aAcc], 0xff\n\t"
"sbci %B[aAcc], 0xff\n\t"
"eor r1, r1\n\t"
: [aAcc] "=&d" (acc)
: [argA] "r" (a), [argB] "r" (b)
: "r18", "r19"
);
return acc;
}
so, taking advantage of that extended syntax. (The r18/r19 clobber, I think, could stand to be set to free scratch registers? Or maybe I tried that and it didn't take, as I don't quite understand the syntax. And, I don't claim that the above is correct in all contexts. It seems to work in a few, I certainly don't have exhaustive unit tests to prove otherwise.)
Tim