letting it pass with bandaids (like, oh, it appears to work with -O0) doesn't sound good.
As my project stands right now, the only -O0 bits are specific functions which are in a "boot block" and have no access to stdlib, and have loops which were getting replaced with memcpy() etc. Now, I know this can be blocked (and has been with -fno-tree-loop-distribute-patterns), but I left these so I don't have to re-test them, and in case that compiler option got dropped off one day by accident.
may be in 3rd-party libraries that you seem to be very reliant on, and I don't blame you for not willing (/having time) to debug other people's code. But sometimes you just don't have a choice.
I now use very few of the bloated ST HAL functions. I've been busy either removing these or replacing them with local versions, stripped down to only the bare minimum required. The SPI code has been eliminated and replaced mostly with a single generic DMA function.
and optimized builds for releases.
That is commonly done but the amount of testing (what is now a complex product, with 2 years of my code, plus ETH, LWIP, MBEDTLS, USB CDC & MSC, FATFS, http server, https client, etc) is more than I want to do, and -Og is so damn close to anything else, that it is hardly worth the time.
And since, as they say, 99% of CPU time is spent in 1% of the code, one can always put a -O3 attribute on a specific function. But, as with assembler, the biggest speedups come not from making code faster but from doing it differently altogether. As I posted before, I once speeded up an IAR Z180 float sscanf about
1000x by specialising it for the actual input format which was always xx.yyyy (it was an HPGL to Postscript converter). I also wrote it in assembler, but that was secondary. Biggest speedups come from cunning use of hardware e.g. I have got a tracking waveform generator which runs almost no software (just DMA, timers, etc).
Some HAL code does not work with -O3, but I don't think I have any of that now. A lot of it was in the "min CS=1 time" department where if you called 2 functions in rapid succession, the time between them was too short. There is also an awful lot of stuff on github which was somebody's "work in progress on some 16MHz AVR, before he got bored" which runs at 168MHz only by luck. Especially if it involves driving chips like SPI FLASH chips. Basically there is a lot of software out there which has to be used with great caution. I spent much of yesterday digging into some 3 year old code driving a STLED316 display driver, which falls over unless you have a ~10us gap between setting the digit cursor and sending the digit data. So I made it 20us and used a dedicated timing function (written in asm) to do it. It previously worked by accident.