If you're using the buffer outside of this block and want the compiler to assume it may have changed, you need to use a pointer to volatile at the level of the outer block to dereference it.
Yes, exactly. (I only quoted the above small part, but I do believe we are in full agreement.)
GCC already has for example a built-in to describe known pointer alignment,
__builtin_assume_aligned(pointer, alignment [, offset ] ). This generates no code, just changes the compiler assumptions about how the pointer (target/value) is aligned.
I expect a similar built-in to be provided, say
__builtin_assume_modified(pointer, size), through ARM GCC efforts initially, because this is becoming more and more of a problem in embedded targets. It simply tells the compiler to invalidate all its existing assumptions about the contents of the referred to memory region, and does not generate any code (like hardware read or write barriers or such). It fixes the issue without extra side effects.
An "implementation" of such built-in is trivial, because we can do it with GCC already (since version 3, probably earlier), via
#define __builtin_assume_modified(ptr, len) __asm__ __volatile__ ("": "+m" (*(char (*)[len])(ptr)))
and, the syntax itself has been explicitly shown in the
GCC Extended Asm documentation. (The reason for making it a built-in is twofold: being a built-in encourages compatibility across compilers; and being documented at the source, would make it easier to point it out and get embedded libraries and frameworks to use it where needed. As it is, the macro itself is a side effect of extended inline assembly, and happens to only rely on the "m" output modifier, which is common to all architectures. Having something more explicit for the task would be clearer for all.)
This is explicitly useful for DMA, and for any other mechanism where the storage representation really isn't
volatile in any sense, only modified once by a mechanism invisible to the compiler. In typical hosted C and C++ environments this is usually not an issue, because being compiled in a separate unit provides a similar barrier for the compiler assumptions of the contents; but in embedded and microcontroller environments, where everything is often compiled in the same unit, we do actually need something like this.
It might be worth it to talk to the arm gcc folks about this, actually. I'm just a nobody, and don't exactly relish telling others what they should do to support their users better, but if somebody already has contacts with them, consider pushing this upstream a bit.