On modern MCUs with caches and branch prediction and whatnot, those delay loops are virtually useless...
I don't understand this. I have done quite some work with modern MCUs with caches, and they mostly come with core-coupled instruction scratchpad, and I just simply put all even remotely timing-critical code there. It has the same performance than having everything 100% in cache with no misses, and is predictable, including the first iteration.
Branch prediction in a simple delay loop should be predictable as well.
IMO, caches are there to increase the
average performance of
large routines when you run out of small core-coupled RAM and have to run "directly" out of FLASH, or, worse, out of external SD card or similar. But this doesn't matter much for small timing critical routines (which the delay loops obviously are) - just run them out of instruction RAM.
I have never turned caches on in an MCU project; turns out, I can always fit all timing-critical processing in tightly coupled RAM, and the rest can be "slow" from flash.
But even if you "have" to run it from cache - it's going to be a small offset at the start, depending on whether it produces a miss or not in the first cycle. Assuming you still run from the internal flash, the difference is a flash cycle or two, or maybe about 20ns.
And I'm using delay loops extensively. Of course they aren't good for precision timing in presence of interrupts, but neither are timer-based busy loops, or interrupt handlers. Doing it accurately in a system with existing interrupts requires a big picture understanding, i.e., setting your interrupt priorities right while making sure nothing breaks in edge cases.
Using a simple delay loop to implement something at least won't risk the existing interrupts, and it's very obvious that the delays are going to be longer than specified depending on interrupt load level. This is much easier than to add a new interrupt handler, configure the pre-emptive priorities right, and
still get jitter to the less important task.