Quote1) Caches are a moot point. They tend to be disabled by default. Just don't enable them.
You think so, eh? I'm talking less about formal "cache memory that you need to enable" and more about things like the "2*64bit prefetch buffer" (stm32f1xx) or the "Enhanced flash memory accelerator" (LPC176x)Your lovely single-cycle 120MHz RISC CPU isn't going to run so well if every instruction takes 5 additional cycles to fetch from the flash program memory (SAMD51. But 50ns access times for flash seem to be "typical.")
As said earlier, run timing-critical routines from core coupled instruction memory, and put the stack on core coupled data memory.
In a typical mid-scale MCU project, I may have 30-50K of program total, out of which about 2-3K is timing critical, and always fit in ITCM, even on small/cheap devices.
Absolutely smallest and cheapest won't have core-coupled RAMs, of course.
12-cycle interrupt latency
including register push, happening at at least 48MHz, possibly even at 400MHz-500MHz on more expensive ARM MCUs, is way faster than a 5-cycle interrupt latency + about 2-4 cycles of software latency for minimum stack push (1-2 registers) in most actual ISRs, run at 20MHz, on AVR.
Both are quick enough for most situations.
Also, interrupt latency needs to be to calculated to the point where the execution has provided you the actual result you need quickly. Sometimes it's the first instruction in ISR (e.g., safety turn-off of a single signal, responding to an analog comparator event), sometimes it may be 10-20 instructions later (e.g., checking for the actual reason of the interrupt, then turning off several signals). Overall quick execution really helps here, as do more flexible GPIOs allowing resetting and setting any arbitrary number of bits on a PORT with a single register access, typical to modern ARM controllers.