Author Topic: [solved] AVR-GCC generates ASM stuffed with bloatware use of reserved registers  (Read 1672 times)

0 Members and 1 Guest are viewing this topic.

Offline T3sl4co1l

  • Super Contributor
  • ***
  • Posts: 22388
  • Country: us
  • Expert, Analog Electronics, PCB Layout, EMC
    • Seven Transistor Labs
Yeah, fact 0 to know about avr-gcc is, it generates awful code.  -O0 is especially verbose, and realize that, by setting this, you are telling the compiler, you want everything to be finalized into memory, after every statement, use the initial, most literal, most verbose instruction sequence, don't touch it, leave it perfectly alone.  So it's going to maximize memory load/store, and dot all the tees and cross all the eyes, for every most trivial of operation.  And yeah, CBI/SBI isn't on the list of standard generation, it's an optimization.

Even on higher settings, it leaves evidence of internal 16-bit operations all over the place (e.g. trivial register swaps that could've been renamed, word-level granularity only i.e. no allocating chars to odd registers, sign/zero extensions that aren't read, etc.), it's completely(?) ignorant of 16+-bit multiplication (uses library calls), it just uses that much more code than is needed.

Fortunately the ISA is easy to understand, so as much as you're checking compiler output, you can just as well [re]write it yourself.  The resources I use are:
https://sourceware.org/binutils/docs/as/index.html - assembler manual itself, don't forget to check the avr-specific section
https://gcc.gnu.org/wiki/avr-gcc - now you can predict where variables are put in registers when a function is called / where they're expected on return
https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html - the best format for inline asm as the compiler can do something with it

If you'd find an example helpful, you may want to take a look at this project:
https://github.com/T3sl4co1l/Reverb/blob/master/asmdsp.S - compare with:
https://github.com/T3sl4co1l/Reverb/blob/master/dsp.c
I wrote this by starting with the C functions, as generated, and optimized them down to, less than half cycle count I think.  (There's also commented-out versions of helper functions, like mac32p16p16, which I used to break down the transition a bit, from optimized core math, to total optimized functions.)  Overhead on 16-bit MAC sort of arithmetic is pretty awful, so this is something of an exaggerated case.  But I was able to do a few things a compiler can't (or, C can't express, or in any case GCC certainly can't) -- like in dspReverbTaps, I use a 24-bit data type.  (I suppose an individual-register-aware compiler would be able to take advantage of the reduced stack usage, but GCC certainly can't.)

I forget anymore if it was this project or another, but occasionally GCC does produce "perfect" code.  If it's not the ADC interrupt here, it'd be some timer or ADC interrupt in another project I did, that was basically: read registers and put them in memory. Trivial operation, not many ways you can screw it up, and fortunately it didn't.

Incidentally, there is one way to coax better generation: in another project, I made use of a struct to hold control loop parameters; GCC normally accesses this with offset indirect instructions, which saves a cycle per on XMEGA, plus a whole word over the LDS it would otherwise emit.  This had the form,

Code: [Select]
ctrlState_t ctrlState;  //  max 32-word struct containing controller state
...
ISR(TIMER_OVERFLOW_vect) {
ctrlRet_t r;  //  couple-word struct
r = updateCtrl(&ctrlState);
// set DAC and timer registers to r members
}

To prevent it from inlining the fixed parameter (&ctrlState), I have the target function set with:

Code: [Select]
#pragma GCC push_options
#pragma GCC optimize ("-fno-inline")
#pragma GCC optimize ("-Os")
#pragma GCC optimize ("-fno-ipa-cp")
ctrlRet_t updateCtrl(volatile ctrlState_t* s) {
...
}
#pragma GCC pop_options

That's definitely something a compiler should just know to do, and I mean, most CPUs prefer this motif as well, so, beats me how the hell AVR misses out on it?... (Relative to avr-gcc 8.1.0, so, hardly an exemplar.)

Speaking of the mul operations, I refined one further in another project:

Code: [Select]
/**
 * Multiplies two 16-bit integers, with rounding, as an intermediate
 * format in 16.16 fixed point, returning the top (integral, 16.0) part.
 */
uint16_t asm_umul16x0p16(uint16_t a, uint16_t b) {
uint16_t acc;

// acc = (((uint32_t)a * (uint32_t)b) + 0x8000ul) >> 16;
__asm__ __volatile__(
"mul %A[argB], %A[argA]\n\t"
"mov r19, r1\n\t"
"mul %B[argB], %B[argA]\n\t"
"movw %A[aAcc], r0\n\t"
"mul %A[argB], %B[argA]\n\t"
"add r19, r0\n\t"
"adc %A[aAcc], r1\n\t"
"eor r18, r18\n\t"
"adc %B[aAcc], r18\n\t"
"mul %B[argB], %A[argA]\n\t"
"adc r19, r0\n\t"
"adc %A[aAcc], r1\n\t"
"adc %B[aAcc], r18\n\t"
"subi r19, 0x80\n\t"
"sbci %A[aAcc], 0xff\n\t"
"sbci %B[aAcc], 0xff\n\t"
"eor r1, r1\n\t"
: [aAcc] "=&d" (acc)
: [argA] "r" (a), [argB] "r" (b)
: "r18", "r19"
);

return acc;
}

so, taking advantage of that extended syntax.  (The r18/r19 clobber, I think, could stand to be set to free scratch registers? Or maybe I tried that and it didn't take, as I don't quite understand the syntax.  And, I don't claim that the above is correct in all contexts.  It seems to work in a few, I certainly don't have exhaustive unit tests to prove otherwise.)

Tim
« Last Edit: July 21, 2024, 05:03:26 am by T3sl4co1l »
Seven Transistor Labs, LLC
Electronic design, from concept to prototype.
Bringing a project to life?  Send me a message!
 
The following users thanked this post: RoGeorge

Online RoGeorgeTopic starter

  • Super Contributor
  • ***
  • Posts: 6715
  • Country: ro
The funny thing is, the most popular AVR, the "Arduino UNO" is typically using an ATmega48 MCU, which is debugWIRE capable

That would be very strange. I've never seen one of those Unos. People buying an Uno expect 2k SRAM and 32k flash in an ATMega328.

Doh, I was thinking ATmega328, fixed the typo, thanks.  :)

Online coppice

  • Super Contributor
  • ***
  • Posts: 9386
  • Country: gb
Currently debuggers are really bad at following these, so they can't tell you the value of various things when you stop the code to see what's up.

That shouldn't be the case, and I haven't seen that myself. If the variable is live then the debugger most certainly should be able to tell you its value, even if it is in different registers (or RAM) at different points. That't what those fancy debug metadata files are for.
Can you name a tool chain that has a robust ability to show the values of variables when they still exist in a register at the time of inspection, so you can hit a breakpoint when something interesting happens, and see what is going on?
The only exception is if you're trying to look at a stale value of a variable after its last use and the compiler reuses that register for a different variable. In that case the variable literally doesn't exist and the debugger should be telling you the variable is out of scope, or display a dash or something, not show you an incorrect value. But if you're stopped at (or before) an instruction that uses that variable then the debugger definitely should be showing you its value.
But you just said it should work. Now you are getting more realistic. It usually doesn't. The more effective the compiler gets, the worse the debugging issue gets. If the problem you are looking at is not in a time critical area, and you can tolerate the code size increase, dropping to -O0 is usually the quickest way to investigate.
Complaining that a debugger doesn't show you the value of a variable after its last use is like complaining that the debugger can't show you the values variables had in the previous N loop iterations :-)  (If you want that, btw, then use nontrivial recursion not a loop)
I'm not strongly complaining. I understand why the problem exists, and I know in the general case it is unavoidable. However, its frustrating to see a value of interest still in its register when you probe deeply enough to identify it, and the debugger hasn't followed and won't display it. I'm pointing out that -O0 is a highly valuable tool for debugging code where time and size constraints permit. This turns out to be a lot of the time, when you allow for stripping things back to the code of interest only.
I guess that's another advantage of printf debugging. The printf is a use of the variable and forces the compiler to preserve its value at least until that point.
printf is for big machine debugging. Its generally unusable for deeply embedded work. There is nowhere to output to without major changes to what is under test. In small MCU applications you are lucky to have a spare pin that some debug code can wiggle when something interesting happens.
 

Offline tggzzz

  • Super Contributor
  • ***
  • Posts: 20564
  • Country: gb
  • Numbers, not adjectives
    • Having fun doing more, with less
Can you name a tool chain that has a robust ability to show the values of variables when they still exist in a register at the time of inspection, so you can hit a breakpoint when something interesting happens, and see what is going on?

I haven't spotted any problems with xTIMEComposer, an Eclipse-based IDE. Naturally I may not have been expecting as much as you.

More of a problem is setting the breakpoint uniquely/accurately in highly optimised (i.e. completely reordered) compiled code.

Quote
printf is for big machine debugging. Its generally unusable for deeply embedded work. There is nowhere to output to without major changes to what is under test. In small MCU applications you are lucky to have a spare pin that some debug code can wiggle when something interesting happens.

I use printf to dump data up a USB link to a PC. Provided the volume isn't too much for the USB link, the hard real time timing remains unchanged, since all that is running in a different core.

However, wiggling pins is just about the first thing I do in a new environment, to verify that (1) it is working and (2) I am using it efficiently. Don't see how you could attempt real-time embedded without it :)
There are lies, damned lies, statistics - and ADC/DAC specs.
Glider pilot's aphorism: "there is no substitute for span". Retort: "There is a substitute: skill+imagination. But you can buy span".
Having fun doing more, with less
 

Online DiTBho

  • Super Contributor
  • ***
  • Posts: 4230
  • Country: gb
printf is for big machine debugging. Its generally unusable for deeply embedded work

There are no doubts, and printf is also an elephant!

However, the idea is good, especially the fact that it allows you to create an observable copy of a variable, effectively passed as a parameter

         uint32_t speed
         ...
         while(is_ok)
         {
               ...
               speed=imu_speed_get(p_obj);
               dbg_uint32_out(speed); /* debug point <---- observable copy, effectively passed as a parameter, and sent to the host */
               ...

and therefore not affected by what the compiler does in reusing registers, and as I wrote before in the embedded field you can use a channel left open by the ICE, and map it into the peripheral space, in this way you can write a function
           dbg_${data_type}_out(${data_type} value)
           data_type={ uint32_t , sint32_t, string, char_t, ...}
function for char-dev nature device, mapped on an ICE channel.

I think it's good because it takes up much fewer resources than printf, and allows you to observe variables.

I used this trick to develop the firmware for an IMU on modern Cortex ARM  :o :o :o :o
The opposite of courage is not cowardice, it is conformity. Even a dead fish can go with the flow
 

Online Nominal Animal

  • Super Contributor
  • ***
  • Posts: 6851
  • Country: fi
    • My home page and email address
If you use GDB for debugging, note that it supports custom commands via plug-ins written in Python.

I investigated this myself a decade ago, and wrote an example to save or display C pointer-based binary trees in Graphviz DOT format.  It adds a tree command, which takes some optional parameters in name=value format (specifying the left and right child pointer member names for example) and the name of the variable pointing to the root node of the binary tree.

Browsing through the GDB Python API is very informative, and can make debugging much nicer, even on AVRs with JTAG via avarice and GDB.
 


Share me

Digg  Facebook  SlashDot  Delicious  Technorati  Twitter  Google  Yahoo
Smf