Well, microcontroller projects (the ones that actually control things, using peripherals etc.) just are not portable, we have to accept this. By trying to make them portable, we either sacrifice so much of the MCU capability that we just say "no" to the projects - or use FPGAs to implement them. Or, we are writing and verifying so much extra abstraction that the original project would have been manually ported ten times when the "portable" project finishes.
IMHO, the key to success is not to strive for ideal world that does not exist in MCUs. After you accept this, you can leverage the features that are there, for the low cost, and save a lot of time and money compared to rolling out an FPGA design (or using an esoteric, vendor-lock-in solution like the XMOS).
You need to make compromises regarding idealism. You have to accept that -O0 is not supposed to produce a working project, so... just don't use it. You need to rely on compiler optimizations, but only to a point, not for cycle accuracy. If you need cycle accuracy, you are on a special case, and you need to prove yourself and the others that other ways of doing it are even more difficult or expensive.
A simple example: you need to react to an analog event within 100 clock cycles, by setting a pin high. You set up the comparator registers, write the interrupt address to the vector table, enable the interrupt at highest priority, and as a very first operation on the ISR function, write to the GPIO register. 10 minutes of work. You test it, and it works perfectly, as expected.
Then you start to think about it. Interrupt entry latency is 12 cycles. Does the GPIO operation require loading an address to a register, from program memory? Am I running code out of flash and if yes, is this part of program memory beyond the prefetch range of the flash? Heck, even if the code in in ITCM, is the vector table in flash? If it is, does the core load the vector address in parallel to stacking the registers? Probably yes but do I need to fully read the Cortex-Mwhatever manual every time I do this?
And at some point, thinking about it changes to overthinking about it. The threshold depends on the margin you originally had.
But in the end, having the port register qualified volatile also means, the compiler cannot reorder the ISR so that the port write would be the last thing. Compiler is of course allowed to insert an unnecessary calculation of Pi before that, but why would it do that?
Finally, you measure the latency with an oscillosscope and see that it's actually taking 17 cycles +/- 1 cycle of jitter and once in a year, when a DMA transfer is triggered during full moon and Michael Jackson's Thriller is playing in the radio, it has 3 cycles of jitter(!!).
Now, tggzzz would say you have proven nothing. But realistically, what are the chances that this breaks down beyond 100 clock cycles when the GCC version updates?
I don't know. At the same time, I get work done. And so does everybody else who works like this. And I have never, ever in my life had an issue where a high-priority interrupt execution would have significantly changed in timing due to some seemingly unrelated change. A few cycles, sure!
And quite frankly, keeping a fixed GCC version during an embedded MCU project is the sane thing to do. This isn't high performance desktop computing requiring security updates. If the original microcontroller chip stays the same, if the code is verified to work within specifications with good margins, why would you suddenly update to a new compiler version during production?