Because it's very very far from what the 386 designers intended.
Uh, who cares what the designers intended? The way they intended their design to be used was not the way it ended up being used. It would be pretty fair to say the '386 was a success not because of its design choices, but despite them.
The designers made the absolutely critical error of relying on
descriptor tables anyway – and
that's what killed it on '386, and got full segmented memory model support removed from AMD64. Descriptor tables, or mappings from arbitrary small integers as address space keys, "segment selectors", to the definitions of those address spaces, just are not an useful abstraction, and creates an extra unneeded point of failure particularly from a security standpoint.
Also, having the result of the segment mappings be an address in a single virtual address space was probably thought of as a necessity, because there was no (and still is no) named address space support in standard C. Nevertheless, that turned out to be a failure, because later processors had to implement PAE to get over the 4GB hump.
If, in the first place, the segments themselves would have had their own page tables, and not limited to a single virtual address space (which itself was then optionally paged in the '386), only the maximum consecutive memory region and the simultaneously addressable memory would have been limited (to 4GB, and
number of simultaneous segments×4GB = 16 GB). So, by designing in an unified virtual address space, they shot themselves in the foot.
Right now, comparing the memory model in
OpenCL to the neutered one provided by
SYCL just to cater to compilers that do not want or cannot to provide named address space support, is an excellent repeat of the mistakes the i386 designers did in choosing how to implement segmented memory; they re-do the exact same proven erroneous choices (by never being used the way their designers intended), and apparently hope that this time it leads to different results.
I do recommend reading the
A. Gozillon, P. Keir: Towards Programmable Address Spaces paper from 2017. I cannot say I wholeheartedly agree or support the choices they describe (and ended up being implemented, and is now available in for example Clang 10), but I did find it informative and interesting; and very relevant to address spaces and segmented memory in general. Note that OpenCL, as discussed in that paper, has a four-level memory hierarchy: "global", "constant", "local", "private". Although this hierarchy is based on the asymmetric multiprocessing hardware OpenCL runs on, the way this memory hierarchy is used, matches pretty darn well with the segmented memory features I'd like to see; call my "ideas" security paranoia and attempts at future-proofing it, to avoid the erroneous design choices already known to be erroneous.
For the topic at hand, assuming it is still something about pointers in embedded/freestanding/nonstandard-nonhosted C and C++ environments, those four also map very well to the address spaces I'd personally love to see in such environments, for security and robustness. The "constant" address space obviously matches the Flash and ROM currently ubiquitous; "local" and "private" match the two types of limited-duration/local-scope variables and objects (that Ataradov called
auto), the former being the ones that need to be accessible to the caller or callees if nontrivial function calls are made in this scope, and the latter being those that are completely local to the current scope. "global" matches whatever hardware or physical address space the environment uses, if it has such a single unified address space. The missing one is globally accessible data, possibly split into static mutable objects and variables and dynamically managed mutable objects and variables.
I find it funky that what works fine for OpenCL, is considered "too hard" or "too complex" for the embedded/freestanding environments.
int foo(int a)
{
volatile char zz[10];
zz[a] = 1;
return zz[a];
}
[...] Call this with foo(100) or foo(-100) and it will overwrite the memory well outside the stack, while SP is perfectly fine.
And this is the most common way stack overflows happen, so any system that does not catch this is not worth considering.
If by system you include both the compiler and the hardware, then I absolutely agree, and to both points in that last sentence.
And will revise my understanding of your actual attitude toward robustness and reliability accordingly. (Not that it matters to anyone but me, but I do find it important to point out my opinions are based on my observations, and when provided with new information, my opinions are likely to change.)
I do suspect that to truly fix this, we do need fundamental changes to C and C++.
Consider hardware that applies a check to each and every effective access using any stack pointer relative addressing modes. The check is a simple bounds check, perhaps written as
(EA < base || EA >= limit), where
base and
limit are internal registers, and when the check triggers, a hardware interrupt is raised, with the effective address available in another internal register. (As discussed, this interrupt can default to just updating
base or
limit, becoming just stack size instrumentation.)
If the compiler does not add an extra software check similar to one verifying
a >= 0 && a < 10 before
zz[a] = 1 in ataradov's example, then even the above hardware effective address check would fail to catch
int bar(char *p, int b);
int foo(int a, int b)
{
volatile char zz[10];
return bar(zz + a, b);
}
simply because the
error occurs when the pointer value is constructed – it is out of bounds for the referred to object – and the pointer
p function
bar() receives, will not be dereferenced using stack pointer relative addressing anyway (because a single unified address space is used).
A lot of the blame can be placed on the programmer, too. If we wanted
bar to be able to detect invalid indexing, we should declare it something like
int bar(char *buf, size_t len, size_t index, int b); instead. The standard C library in particular could have better interfaces. It would only need one line of added code to check the value of
a is a valid index to the
zz array before constructing the pointer
zz+a. And so on.
It is not an easy problem to solve; and is basically impossible, if the compiler developers choose not to participate.
For what it is worth, I have not found a combination of options to get gcc-7.5.0, g++-7.5.0, nor clang-10 compiling C or C++, to complain even a peep about my example above. Yet, it is something that immediately sticks in my eye when I look at code, exactly because it so often leads to annoying bugs.
stack overflow [versus] buffer overflow [on a stack based buffer]
Very good point; I missed that myself.
Perhaps it is a good idea to remind oneself that on small microcontrollers with limited RAM, heap and stack are typically the opposite ends of a single continuous block of RAM. Dynamic memory allocations reduce the space left for the stack (unless they use a hole left from an earlier allocation since freed), so basically we have this waterline that varies at runtime (indicating the end of currently allocated dynamic memory with the hightest address) that the stack must not cross.
One reason runtime heuristics like stack canaries have such a bad time detecting this before the device has already crashed and pooped all over, is that that waterline does not stay constant, it moves (if any dynamic memory allocations are done), and it could be either a dynamic memory allocation or the stack growing that caused the waterline to be crossed.
Now, add a nasty buffer overrun – especially the kind that does not just fill an array over its allocated size, but nefariously accesses/modifies a single byte or a group of buffer entries way past the end (or the beginning) of the buffer – and you get the kind of bugnest that can cause one to decide to switch to woodworking. At least there you get to use a hammer on any bugs you see. Canaries are rather unlikely to happen to be exactly where that access ended up modifying memory, so may not help at all.
(That said, off by one errors seem to be the most common buffer overrun cases, i.e. overwriting a byte/int just preceding or immediately succeeding the intended object. Those are relatively easy to catch. But the nasty ones are the jumpy ones, as they can be very hard to spot in the code. Integer promotion causing sign extension on something that was intended to be an unsigned value can be very hard to spot, and if they occur at an index calculation, the end result can be way off. This is one reason you'll see my own code using way more explicit casts than what are technically required; since the casts typically only cost human observation and do not generate extra machine code, I consider it an appropriate way to try and avoid some of those nasty kangaroo indexing bugs. A semi-related case in point: how many C programmers know or care that if they happen to have a char or int
c, the proper way to test in a hosted environment if
c is a whitespace character, is NOT
isspace(c), but
isspace((unsigned char)c)?)
On an embedded architecture, it would be rather nice to have that waterline address in a special register, even one that is relatively slow to access and update, if the stack pointer crossing it would cause a hardware interrupt. It would not help with the
buffer under/overrun/overflow bugs, but it would make the
stack waterline crossing detection at runtime, deterministic.
I can even imagine/describe a couple of C programming patterns (admittedly using
setjmp()/
longjmp() which I do not like at all to use) that could set up a safe state to revert to if a waterline crossing event were to occur, so that a reboot or crash could actually be avoidable in many situations. (It won't complete/revert I/O done meanwhile, so it is more about cancelling computational rather than I/O work when that work cannot be done with the currently available stack space.)
However is there a case that global variable shouldn’t be declared as volatile? And could they always be volatile as default?
Declaring a variable
volatile is always safe, just potentially
inefficient.
You see, the C standards define
volatile as
Accesses to volatile objects are evaluated strictly according to the rules of the abstract
machine.
and points out in a footnote that
An implementation might define a one-to-one correspondence between abstract and actual semantics: at every sequence point, the values of the actual objects would agree with those specified by the abstract semantics. The keyword volatile would then be redundant.
Indeed, some C compilers did do just that.
A core method current C compilers generate much more efficient code, is that if an object is not examined, and it is not
volatile, its value does not matter. (This is also why you will see all memory-mapped I/O register objects in C and C++ declared
volatile. If they are not, it is a bug.)
The way I define
volatile may not be exactly correct (in the language lawyer sense), but it is very useful intuitive definition and
correct in the real world: it tells the compiler that the object may be concurrently modified by some other code the compiler does not know about, and therefore the compiler must not, is not allowed to, make any
assumptions. Without
volatile, an assumption a C compiler can make, is for example that if object
foo is not modified by any code the compiler knows about between sequence points
X and
Y, the compiler can reuse the value of
foo it had at sequence point
X at the later sequence point
Y.
For example, if you have say
double doh(const double *const xref, const double *const yref, const double *const zref)
{
double result;
result = (*xref) * (*yref);
do_something_slow_1();
result += (*xref) * (*zref);
do_something_slow_2();
result += (*yref) * (*zref);
do_something_slow_3();
return result;
}
a C compiler is free to generate the same machine code as it would for
double doh(const double *const xref, const double *const yref, const double *const zref)
{
const double x = *xref;
const double y = *yref;
const double z = *zref;
do_something_slow_1();
do_something_slow_2();
do_something_slow_3();
return x*y + x*z + y*z;
}
only because the pointers do not point to
volatile doubles, and
result is only observable within its local scope (and not in
do_something_slow_n() functions).
If the pointers were declared as
const volatile double *, then the compiler would NOT be allowed to do this: it would have to dereference the pair of pointers between the calls to
do_something_slow_n() functions, to acquire the values of the referred to objects
without "caching" them across sequence points.
To see why
volatile matters, just consider another thread modifying the values that
xref,
yref, and
zref point to, during the calls to the the
do_something_slow_n() functions. The result you obtain from the function call then depends on whether you declare the values the pointers point to
volatile or not. (Declaring the pointer variable itself volatile, say
const double *const volatile xref, would be silly, because it'd tell the compiler that the pointer may be modified by some unseen code.)
In all cases, having the
volatile there means the compiler will follow the C standard abstract machine model more strictly, so if you ever find code that behaves correctly
without volatile, and incorrectly
with volatile, then that code is strange and very suspect indeed; it
must rely on the compiler to generate the code in some specific way, regardless of what the C standard says the compiler is allowed or should do in that situation. Bad, bad code, that; needs a rewrite for sure.
The final wrinkle is exactly what a sequence point is in the C standard. Fortunately, the standards have an informal annex (so not "this is what it is", but "we the standard writers believe that the sequence points are these, but if the text of the standard disagrees, then the text of the standard is right and this list wrong") stating that sequence points are:
- Between the evaluations of the function designator and actual arguments in a function
call and the actual call - Between the evaluations of the first and second operands of logical AND (&&), logical OR (||), and the comma operator (,)
- Between the evaluations of the first operand of the conditional ? : operator and
whichever of the second and third operands is evaluated - The end of a full declarator
- Between the evaluation of a full expression and the next full expression to be
evaluated. (Full expressions being an initializer that is not part of a
compound literal, the expression in an expression statement, the
controlling expression of a if or switch selection statement, the
controlling expression of a while or do statement, each of the (optional)
expressions of a for statement, and the (optional) expression in a return
statement.) - Immediately before a library function returns
- After the actions associated with each formatted input/output function conversion
specifier - Immediately before and immediately after each call to a comparison function, and
also between any call to a comparison function
according to the final published draft of the C11 standard, also known as n1570.pdf. Sequence points themselves are just the concept of how the C standard defines the order of effects. Between two sequence points, effects or side effects can occur in whatever order; but generally speaking, the sequence points are defined such that each useful effect or observable result or side effect is nicely bracketed between two sequence points.
I anticipate that there might be some discussion looming whether a C compiler is allowed to generate the same code for the two doh() functions I showed.
Instead of getting bogged down in language-lawyerism, let's expand it a bit into a complete example we can compile and examine:
static volatile int n = 0;
void do_something_slow_1(void) { n += 1; }
void do_something_slow_2(void) { n += 2; }
void do_something_slow_3(void) { n += 3; }
static inline double doh1i(const double *const xref, const double *const yref, const double *const zref)
{
double result;
result = (*xref) * (*yref);
do_something_slow_1();
result += (*xref) * (*zref);
do_something_slow_2();
result += (*yref) * (*zref);
do_something_slow_3();
return result;
}
static inline double doh2i(const double *const xref, const double *const yref, const double *const zref)
{
const double x = *xref;
const double y = *yref;
const double z = *zref;
do_something_slow_1();
do_something_slow_2();
do_something_slow_3();
return x*y + x*z + y*z;
}
double doh1(const int ix, const int iy, const int iz)
{
const double x = ix, y = iy, z = iz;
return doh1i(&x, &y, &z);
}
double doh2(const int ix, const int iy, const int iz)
{
const double x = ix, y = iy, z = iz;
return doh2i(&x, &y, &z);
}
Using clang-10 -Wall -O2 -std=c11 -c doh.c this compiles to
doh1: doh2:
cvtsi2sd %edi, %xmm1 cvtsi2sd %edi, %xmm1
cvtsi2sd %esi, %xmm0 cvtsi2sd %esi, %xmm0
cvtsi2sd %edx, %xmm2 cvtsi2sd %edx, %xmm2
addl $1, n(%rip) addl $1, n(%rip)
movapd %xmm1, %xmm3 addl $2, n(%rip)
mulsd %xmm0, %xmm3 addl $3, n(%rip)
mulsd %xmm2, %xmm1 movapd %xmm1, %xmm3
addsd %xmm3, %xmm1 mulsd %xmm0, %xmm3
addl $2, n(%rip) mulsd %xmm2, %xmm1
mulsd %xmm2, %xmm0 addsd %xmm3, %xmm1
addsd %xmm1, %xmm0 mulsd %xmm2, %xmm0
addl $3, n(%rip) addsd %xmm1, %xmm0
retq
You do not need to understand AT&T syntax AMD64 assembly (which has source on the left and destination on the right, opposite to Intel syntax): All you need to know is that the instructions that load the doubles from memory have the -offset(%rsp), %xmmN format, and the slow function calls correspond to the addl $1, n(%rip) instructions.
Simply put, Clang-10 keeps the instruction order basically intact even without the volatile.
GCC-7.5.0 (gcc -Wall -O2 -std=c11 -c doh.c) generates
doh1: doh2:
pxor %xmm2, %xmm2 pxor %xmm2, %xmm2
movl n(%rip), %eax movl n(%rip), %eax
pxor %xmm3, %xmm3 pxor %xmm1, %xmm1
pxor %xmm1, %xmm1 pxor %xmm3, %xmm3
cvtsi2sd %edi, %xmm2 cvtsi2sd %edi, %xmm2
addl $1, %eax addl $1, %eax
cvtsi2sd %edx, %xmm3 cvtsi2sd %esi, %xmm1
movl %eax, n(%rip) movl %eax, n(%rip)
cvtsi2sd %esi, %xmm1 cvtsi2sd %edx, %xmm3
movl n(%rip), %eax movl n(%rip), %eax
addl $2, %eax addl $2, %eax
movl %eax, n(%rip) movl %eax, n(%rip)
movl n(%rip), %eax movl n(%rip), %eax
addl $3, %eax addl $3, %eax
movl %eax, n(%rip) movl %eax, n(%rip)
movapd %xmm2, %xmm4 movapd %xmm2, %xmm0
mulsd %xmm3, %xmm2 mulsd %xmm3, %xmm2
mulsd %xmm1, %xmm4 mulsd %xmm1, %xmm0
mulsd %xmm3, %xmm1 mulsd %xmm3, %xmm1
movapd %xmm2, %xmm0 addsd %xmm0, %xmm2
addsd %xmm4, %xmm0 addsd %xmm1, %xmm2
addsd %xmm1, %xmm0 movapd %xmm2, %xmm0
ret ret
which is basically identical for both, aside from register naming differences.
Language-lawyerism aside, it means that if you use GCC-7.5.0, with this kind of a code pattern, what I described in my previous post will happen to you too:
without volatile, the two versions of doh() will generate the same machine code.
The instruction pattern GCC-7.5.0 generates for updating the counter n is
movl n(%rip), %eax
addl $N, %eax
movl %eax, n(%rip)
which annoys the heck out of me. It is not just the sane addl $N, n(%rip) clang-10 uses, and I cannot fathom why; I thought this kind of superfluous register dance was more or less fixed a couple of major versions ago... This is also why I don't trust compilers any further than I examine their output, and is the reason why I use extended inline assembly functions for oddball memory-mapped I/O register accesses: to ensure the exact instruction I want will be used.
Nevertheless, I should be happy, because it backs up my argument. (I'm not, because I don't want to win. I want to help others write better code, and especially to write and show me better code than I myself can write, because I'm selfish and self-centered and only care about winning my past self. That dude was an asshole.)
If we replace const double *const with const volatile double *const, then clang-10 generates
doh1: doh2:
cvtsi2sd %edi, %xmm0 cvtsi2sd %edi, %xmm0
cvtsi2sd %esi, %xmm1 movsd %xmm0, -8(%rsp)
movsd %xmm0, -8(%rsp) xorps %xmm0, %xmm0
movsd %xmm1, -16(%rsp) cvtsi2sd %esi, %xmm0
xorps %xmm0, %xmm0 cvtsi2sd %edx, %xmm1
cvtsi2sd %edx, %xmm0 movsd %xmm0, -16(%rsp)
movsd %xmm0, -24(%rsp) movsd %xmm1, -24(%rsp)
movsd -8(%rsp), %xmm0 movsd -8(%rsp), %xmm1
mulsd -16(%rsp), %xmm0 movsd -16(%rsp), %xmm0
addl $1, n(%rip) movsd -24(%rsp), %xmm2
movsd -8(%rsp), %xmm1 addl $1, n(%rip)
mulsd -24(%rsp), %xmm1 addl $2, n(%rip)
addl $2, n(%rip) addl $3, n(%rip)
addsd %xmm0, %xmm1 movapd %xmm1, %xmm3
movsd -16(%rsp), %xmm0 mulsd %xmm0, %xmm3
mulsd -24(%rsp), %xmm0 mulsd %xmm2, %xmm1
addsd %xmm1, %xmm0 addsd %xmm3, %xmm1
addl $3, n(%rip) mulsd %xmm2, %xmm0
retq addsd %xmm1, %xmm0
retq
the difference being that now doh1() has exactly the behaviour we/I/the author intended.
Like I claimed, volatile stops clang-10 from generating the same code it does for doh2(). This means we can use volatile as I described in my previous post to control what kind of assumptions the compiler can make. Here, we want to do the slow calls in between accesses to the pointed-to doubles, so we need to tell the compiler the pointed-to objects are volatile, and it does what we expect it to. Nice.
As I'm writing this, I'm seriously considering switching from gcc-7.5.0 to clang-10 on at least AMD64. I didn't realize before that clang-10 output is that much better.
Anyway, here is the GCC-7.5.0 output, when function parameters are declared as const volatile double *const ref:
doh1: doh2:
pxor %xmm0, %xmm0 pxor %xmm0, %xmm0
subq $40, %rsp subq $40, %rsp
movq %fs:40, %rax movq %fs:40, %rax
movq %rax, 24(%rsp) movq %rax, 24(%rsp)
xorl %eax, %eax xorl %eax, %eax
cvtsi2sd %edi, %xmm0 cvtsi2sd %edi, %xmm0
movsd %xmm0, (%rsp) movsd %xmm0, (%rsp)
pxor %xmm0, %xmm0 pxor %xmm0, %xmm0
movsd (%rsp), %xmm1 movsd (%rsp), %xmm2
cvtsi2sd %esi, %xmm0 cvtsi2sd %esi, %xmm0
movsd %xmm0, 8(%rsp) movsd %xmm0, 8(%rsp)
pxor %xmm0, %xmm0 pxor %xmm0, %xmm0
cvtsi2sd %edx, %xmm0 movsd 8(%rsp), %xmm1
movsd %xmm0, 16(%rsp) cvtsi2sd %edx, %xmm0
movsd 8(%rsp), %xmm0 movsd %xmm0, 16(%rsp)
movl n(%rip), %eax movapd %xmm2, %xmm0
mulsd %xmm0, %xmm1 movsd 16(%rsp), %xmm3
addl $1, %eax movl n(%rip), %eax
movl %eax, n(%rip) mulsd %xmm1, %xmm0
movsd (%rsp), %xmm0 mulsd %xmm3, %xmm2
movsd 16(%rsp), %xmm2 mulsd %xmm3, %xmm1
movl n(%rip), %eax addl $1, %eax
mulsd %xmm2, %xmm0 movl %eax, n(%rip)
addl $2, %eax movl n(%rip), %eax
movl %eax, n(%rip) addsd %xmm2, %xmm0
addsd %xmm0, %xmm1 addl $2, %eax
movsd 8(%rsp), %xmm0 movl %eax, n(%rip)
movsd 16(%rsp), %xmm2 movl n(%rip), %eax
movl n(%rip), %eax addsd %xmm1, %xmm0
mulsd %xmm2, %xmm0 addl $3, %eax
addl $3, %eax movl %eax, n(%rip)
movl %eax, n(%rip) movq 24(%rsp), %rax
movq 24(%rsp), %rax xorq %fs:40, %rax
xorq %fs:40, %rax jne .L12
addsd %xmm1, %xmm0 addq $40, %rsp
jne .L8 ret
addq $40, %rsp .L12:
ret call __stack_chk_fail@PLT
.L8:
call __stack_chk_fail@PLT
If we were to pore through it, we'd find that doh1() indeed does the slow function calls (inlined) in between (re)loading the double values and using the recently (re)loaded values for the product it adds to the sum; exactly as we/I/the author apparently intended it to work.
doh2() now has completely different machine code compared to doh1(), and indeed does the slow function calls (inlined) in a batch near the end of the function.
Simply put, volatile made GCC generate machine code for doh1() with exactly the order of side effects (increments of n) we want, exactly as I claimed.
I'm just not happy at how inefficient code GCC-7.5.0 generates here, at all. It does not detract from my argument, and kinda even reinforces the idea that no matter what the standards say, you're better off examining the actual output of your tools.
TL;DR: This long-ass examination of GCC-7.5.0 and Clang-10 output on AMD64, proves that even if my understanding of the C standard is wrong, the example case shown here shows that what I described does happen in real life, as it happens exactly as I described to this particular example code on AMD64. I could still be wrong, but even if I am, it means the situation is even more complex in reality, and while my understanding may need fixing, it does apply to at least this here case exactly.