It is very difficult to implement Principle of Least Privilege at the function level with linear address spaces.
We'd really need hardware specifically oriented towards nonlinear address spaces, perhaps via segmented memory where each pointer is a (segment, offset) tuple, and any region/area/array is a (segment, offset, length) tuple, with each segment having their own access protections. Virtual memory paging would be implemented on a segment basis, with even DMA using segment identifiers. This would also help with memory fragmentation, which is an issue with linear address spaces and address randomization schemes intended to hinder exploits. (The corresponding scheme here would be randomized segment identifiers, as opposed to consecutive ones. Efficient hardware segment identifier lookup is a bit of a problem, though; we do need this to be O(1) and fast. The 80386 scheme of segment identifiers being indexes to a lookup table does not really work; they're too predictable.)
One model I've been thinking about (just as a mental exercise) is having each stack frame be a separate segment in a chain of segments. Each subroutine call machine instruction would have a separate bit indicating whether the current stack frame is locked against writes for the duration of the call.
Because of the overheads involved in that, I don't think it is as useful as separate stack for return addresses only and using programming language constructs that allow array bounds checking during compile time. For C, making arrays first-level objects (constructable from expressions specifying the pointer and the length), and in function parameters allow array size variable to be declared after the variably-modified array parameter itself (i.e. (int a[n], size_t n)), would suffice; without forcing it upon all C programmers. This would be backwards-compatible on the C level, too.
Current operating systems using virtual memory already essentially provide many "segments" to userspace applications, just located at separate addresses, with the rest of the address space inaccessible (yielding segmentation violation errors on attempted accesses). While I've worked on huge datasets, I think it might actually be beneficial to limit segments to 32 bits even on 64-bit architectures, because of the compactness and savings in the code and memory references themselves. A pointer would still be 64 bits, with the segment identifier in the upper 32 bits; and Very Large maps or allocations could simply use consecutive segment identifiers. If the instruction set had separate registers for default segment identifiers for loads and stores, then a zero segment identifier could be used as a shorthand for those, allowing 32-bit pointers to 64-bit memory to be used efficiently. Also reducing the segment lookup caching overhead to cases where a different segment is used. In most cases, even complex program code uses very few segments concurrently, so hardware support for just a few concurrently accessible segments, say 8, should work very well.
As usual, the biggest hurdle in such schemes are us humans. Zero999's assertion of anything non-backwards compatible being impossible to sell is a good example. I do not exactly agree with it here because both ARM and RISC-V already have various instruction set extensions and variants; but it is very true that in general, customers do expect and require backwards compatibility. Industry and toolchain vendors/projects have a LOT invested in the current architectures, and any kind of fundamental change in the approach/paradigm will be fought against, because it always incurs additional cost. Perfect is the enemy of good enough, true, but when utter shite is considered good enough for customers, we do get locked into decades of poor solutions.