Many "mid-range" chips use quite wide flash words for their internal flash, like 64 or 128 bits, meaning they fetch multiple instructions at once. Usually with these mid-range parts, the flash speed is like 1/4 of the CPU frequency, so if you fetch 4 instructions at the time, there are no wait states.... in theory, if there are no jumps in the code.
This may or may not be called some kind of "accelerator", or "prefetch". It's like a very trivial and small case of a "cache".
In practice with mid-range STM32s (F3, F4) with such prefetch, I tend to see approx. 30% penalty from running from flash directly, compared to running from core-coupled instruction RAM region, not too bad usually.
Caches kind of suck in the microcontroller world, because often you want repeatability and predictability. Worst case performance, instead of average performance. Providing separate "scratchpad" RAM areas, i.e., separate RAM sections with their own interface, directly on the side of the CPU core, does much better job, so many MCUs provide those instead, or in addition, to caches.
In any case, instruction cache is not that bad, because typically you don't change the code on the fly, i.e., code is read only, so there are no consistency issues, cached data never goes bad.
Data caching comes with a bunch of problems, but data caching is still rare in the microcontroller world. I would more likely see larger number of core-coupled memories so you can put all timing-critical data there, and know it's always accessed in 1 cycle.