Even on NXP i.MX RT1062 (as used on Teensy 4.x) one has to do a manual cache flush between DMA and non-DMA accesses to the same memory. It's not too expensive, but it is basically impossible to optimize at the machine code generation level. (Teensyduino Core for these does such flushes around DMA operations, which isn't optimal, but not too bad either.)
In other words, some features cannot be automated, as they depends on the programmer intent, and is not really decidable by the compiler. (A single-threaded interpreter could track it and only do a cache flush when needed, but it turns out that on these, the cache flush is much cheaper than tracking all accesses.)
_ _ _
There is one very useful intermediate representation detail that nicely illustrates the subtle constraints (as mentioned by SiliconWizard) the IR may put on language implementations, somewhat related to the above: data-parallel loop optimizations.
Backends that support Fortran typically support at least two different types of loops with completely different semantics: FOR loops, with similar semantics and requirements as C/C++/etc. loops normally have, essentially sequential data loops; and FORALL for parallel data loops with associated limitations and semantics. Essentially, FORALL loops have no predetermined order of iterations, but occur in parallel, with an arbitrary or undefined order of side effects (like memory access pattern). In particular, FORALL loop bodies cannot refer to results calculated by other iterations in the same loop.
If your backend IR supports both data-sequential "FOR" loops and data-parallel "FORALL" loops, even if your language does not have a similar concept, it is often possible to convert data-sequential loops into data-parallel loops as an algorithmic optimization, when the language constructs used within the loop body permit such conversion. In practice, the handling of the iterator variable and the iteration direction for "FORALL" loops varies depending on the target architecture, and is implemented using a machine code pattern best suited for that type of instruction set architecture.
If we take a wider view, we can see that this difference between FOR and FORALL loops is essentially algorithmic –– data-sequential versus data-parallel loops ––, and if the IR does not support data-parallel loops, to achieve similar benefits, one would need to convert the loop representation to one that best suits each target architecture, requiring quite a lot of effort.
To circle back to the things like cache flush patterns that compilers may not in all cases be able to deduce for themselves, the FOR/FORALL difference is opposite: it is not too difficult for the compiler to determine if the data references in the loop body require sequential data access, or whether the data accesses would still be valid if they were done all in parallel, and thus determine when FOR loops can be turned into FORALL loops.
Thus, it is not necessary for the programming language itself to differentiate between FOR and FORALL loops –– even I can argue both for and against! ––, but it is necessary for the compiler to understand the difference between data-sequential and data-parallel loops, and the IR/backend to support both, to truly leverage this.
_ _ _
In high-performace computing Fortran is still used, because it is easier to write efficient computation in Fortran than in C or C++. Comparing the microbenchmarks exhibiting the differences, and the machine code generated, the main reasons are surprisingly few, and all involve arrays: efficient array slicing, data-parallel loops, and filling an array with a constant through duplication of the initial member without interpreting the value of the initial member.
On x86-64, data-parallel loops involving floating-point data are particularly easy to vectorize, leading to more efficient machine code. To achieve the same in C or C++, one needs to write the loop in a very specific form (specific to the compiler version), or use vector intrinsics provided by the compiler (<immintrin.h> for x86 and x86-64).
It is exactly for this kind of thing that I insist one must look at the machine code generated, and at wildly different programming languages, when designing a new language. True efficiency requires understanding and the application of many such concepts and details, and I personally have not yet seen any books or articles that describes many of them – in particular, because single-instruction multiple-data vectorization (SIMD) is quite "new" in computer science terms, and a lot of the findings are still in peer-reviewed articles and not yet collected and their meaning distilled into books that describe the best current understanding and approaches.
Concepts like the difference between FOR and FORALL loops (or rather, data-sequential and data-parallel loops) definitely look insignificant on the surface, but when you look into how they actually affect the code generated for different hardware architectures, and how much it helps with SIMD vectorization (so much so that I personally don't think SIMD vectorization outside FORALL/data-parallel loops makes much sense!), you realize they can make a very significant difference on certain architectures –– we're talking 2× to 8× computational performance in some cases.
And while FOR/FORALL is something the compiler can handle by itself without the language requiring explicit constructs, some other constructs like cache flushes, even customized memory prefetch patterns, may have to be exposed to the human developer. This means that a "single universal language fit for all purposes" is not only unrealistic, it would have very un-optimal performance. There are not that many situations where one is willing to give up say nine tenths of the performance just to achieve exactly predictable behaviour; personally, I care about the limits of the behaviour, and need those to be well defined, but do not care at all about the distribution of the behaviour as long as it is within those known limits.