I anticipate that there might be some discussion looming whether a C compiler is allowed to generate the same code for the two doh() functions I showed.
Instead of getting bogged down in language-lawyerism, let's expand it a bit into a complete example we can compile and examine:
static volatile int n = 0;
void do_something_slow_1(void) { n += 1; }
void do_something_slow_2(void) { n += 2; }
void do_something_slow_3(void) { n += 3; }
static inline double doh1i(const double *const xref, const double *const yref, const double *const zref)
{
double result;
result = (*xref) * (*yref);
do_something_slow_1();
result += (*xref) * (*zref);
do_something_slow_2();
result += (*yref) * (*zref);
do_something_slow_3();
return result;
}
static inline double doh2i(const double *const xref, const double *const yref, const double *const zref)
{
const double x = *xref;
const double y = *yref;
const double z = *zref;
do_something_slow_1();
do_something_slow_2();
do_something_slow_3();
return x*y + x*z + y*z;
}
double doh1(const int ix, const int iy, const int iz)
{
const double x = ix, y = iy, z = iz;
return doh1i(&x, &y, &z);
}
double doh2(const int ix, const int iy, const int iz)
{
const double x = ix, y = iy, z = iz;
return doh2i(&x, &y, &z);
}
Using clang-10 -Wall -O2 -std=c11 -c doh.c this compiles to
doh1: doh2:
cvtsi2sd %edi, %xmm1 cvtsi2sd %edi, %xmm1
cvtsi2sd %esi, %xmm0 cvtsi2sd %esi, %xmm0
cvtsi2sd %edx, %xmm2 cvtsi2sd %edx, %xmm2
addl $1, n(%rip) addl $1, n(%rip)
movapd %xmm1, %xmm3 addl $2, n(%rip)
mulsd %xmm0, %xmm3 addl $3, n(%rip)
mulsd %xmm2, %xmm1 movapd %xmm1, %xmm3
addsd %xmm3, %xmm1 mulsd %xmm0, %xmm3
addl $2, n(%rip) mulsd %xmm2, %xmm1
mulsd %xmm2, %xmm0 addsd %xmm3, %xmm1
addsd %xmm1, %xmm0 mulsd %xmm2, %xmm0
addl $3, n(%rip) addsd %xmm1, %xmm0
retq
You do not need to understand AT&T syntax AMD64 assembly (which has source on the left and destination on the right, opposite to Intel syntax): All you need to know is that the instructions that load the doubles from memory have the -offset(%rsp), %xmmN format, and the slow function calls correspond to the addl $1, n(%rip) instructions.
Simply put, Clang-10 keeps the instruction order basically intact even without the volatile.
GCC-7.5.0 (gcc -Wall -O2 -std=c11 -c doh.c) generates
doh1: doh2:
pxor %xmm2, %xmm2 pxor %xmm2, %xmm2
movl n(%rip), %eax movl n(%rip), %eax
pxor %xmm3, %xmm3 pxor %xmm1, %xmm1
pxor %xmm1, %xmm1 pxor %xmm3, %xmm3
cvtsi2sd %edi, %xmm2 cvtsi2sd %edi, %xmm2
addl $1, %eax addl $1, %eax
cvtsi2sd %edx, %xmm3 cvtsi2sd %esi, %xmm1
movl %eax, n(%rip) movl %eax, n(%rip)
cvtsi2sd %esi, %xmm1 cvtsi2sd %edx, %xmm3
movl n(%rip), %eax movl n(%rip), %eax
addl $2, %eax addl $2, %eax
movl %eax, n(%rip) movl %eax, n(%rip)
movl n(%rip), %eax movl n(%rip), %eax
addl $3, %eax addl $3, %eax
movl %eax, n(%rip) movl %eax, n(%rip)
movapd %xmm2, %xmm4 movapd %xmm2, %xmm0
mulsd %xmm3, %xmm2 mulsd %xmm3, %xmm2
mulsd %xmm1, %xmm4 mulsd %xmm1, %xmm0
mulsd %xmm3, %xmm1 mulsd %xmm3, %xmm1
movapd %xmm2, %xmm0 addsd %xmm0, %xmm2
addsd %xmm4, %xmm0 addsd %xmm1, %xmm2
addsd %xmm1, %xmm0 movapd %xmm2, %xmm0
ret ret
which is basically identical for both, aside from register naming differences.
Language-lawyerism aside, it means that if you use GCC-7.5.0, with this kind of a code pattern, what I described in my previous post will happen to you too:
without volatile, the two versions of doh() will generate the same machine code.
The instruction pattern GCC-7.5.0 generates for updating the counter n is
movl n(%rip), %eax
addl $N, %eax
movl %eax, n(%rip)
which annoys the heck out of me. It is not just the sane addl $N, n(%rip) clang-10 uses, and I cannot fathom why; I thought this kind of superfluous register dance was more or less fixed a couple of major versions ago... This is also why I don't trust compilers any further than I examine their output, and is the reason why I use extended inline assembly functions for oddball memory-mapped I/O register accesses: to ensure the exact instruction I want will be used.
Nevertheless, I should be happy, because it backs up my argument. (I'm not, because I don't want to win. I want to help others write better code, and especially to write and show me better code than I myself can write, because I'm selfish and self-centered and only care about winning my past self. That dude was an asshole.)
If we replace const double *const with const volatile double *const, then clang-10 generates
doh1: doh2:
cvtsi2sd %edi, %xmm0 cvtsi2sd %edi, %xmm0
cvtsi2sd %esi, %xmm1 movsd %xmm0, -8(%rsp)
movsd %xmm0, -8(%rsp) xorps %xmm0, %xmm0
movsd %xmm1, -16(%rsp) cvtsi2sd %esi, %xmm0
xorps %xmm0, %xmm0 cvtsi2sd %edx, %xmm1
cvtsi2sd %edx, %xmm0 movsd %xmm0, -16(%rsp)
movsd %xmm0, -24(%rsp) movsd %xmm1, -24(%rsp)
movsd -8(%rsp), %xmm0 movsd -8(%rsp), %xmm1
mulsd -16(%rsp), %xmm0 movsd -16(%rsp), %xmm0
addl $1, n(%rip) movsd -24(%rsp), %xmm2
movsd -8(%rsp), %xmm1 addl $1, n(%rip)
mulsd -24(%rsp), %xmm1 addl $2, n(%rip)
addl $2, n(%rip) addl $3, n(%rip)
addsd %xmm0, %xmm1 movapd %xmm1, %xmm3
movsd -16(%rsp), %xmm0 mulsd %xmm0, %xmm3
mulsd -24(%rsp), %xmm0 mulsd %xmm2, %xmm1
addsd %xmm1, %xmm0 addsd %xmm3, %xmm1
addl $3, n(%rip) mulsd %xmm2, %xmm0
retq addsd %xmm1, %xmm0
retq
the difference being that now doh1() has exactly the behaviour we/I/the author intended.
Like I claimed, volatile stops clang-10 from generating the same code it does for doh2(). This means we can use volatile as I described in my previous post to control what kind of assumptions the compiler can make. Here, we want to do the slow calls in between accesses to the pointed-to doubles, so we need to tell the compiler the pointed-to objects are volatile, and it does what we expect it to. Nice.
As I'm writing this, I'm seriously considering switching from gcc-7.5.0 to clang-10 on at least AMD64. I didn't realize before that clang-10 output is that much better.
Anyway, here is the GCC-7.5.0 output, when function parameters are declared as const volatile double *const ref:
doh1: doh2:
pxor %xmm0, %xmm0 pxor %xmm0, %xmm0
subq $40, %rsp subq $40, %rsp
movq %fs:40, %rax movq %fs:40, %rax
movq %rax, 24(%rsp) movq %rax, 24(%rsp)
xorl %eax, %eax xorl %eax, %eax
cvtsi2sd %edi, %xmm0 cvtsi2sd %edi, %xmm0
movsd %xmm0, (%rsp) movsd %xmm0, (%rsp)
pxor %xmm0, %xmm0 pxor %xmm0, %xmm0
movsd (%rsp), %xmm1 movsd (%rsp), %xmm2
cvtsi2sd %esi, %xmm0 cvtsi2sd %esi, %xmm0
movsd %xmm0, 8(%rsp) movsd %xmm0, 8(%rsp)
pxor %xmm0, %xmm0 pxor %xmm0, %xmm0
cvtsi2sd %edx, %xmm0 movsd 8(%rsp), %xmm1
movsd %xmm0, 16(%rsp) cvtsi2sd %edx, %xmm0
movsd 8(%rsp), %xmm0 movsd %xmm0, 16(%rsp)
movl n(%rip), %eax movapd %xmm2, %xmm0
mulsd %xmm0, %xmm1 movsd 16(%rsp), %xmm3
addl $1, %eax movl n(%rip), %eax
movl %eax, n(%rip) mulsd %xmm1, %xmm0
movsd (%rsp), %xmm0 mulsd %xmm3, %xmm2
movsd 16(%rsp), %xmm2 mulsd %xmm3, %xmm1
movl n(%rip), %eax addl $1, %eax
mulsd %xmm2, %xmm0 movl %eax, n(%rip)
addl $2, %eax movl n(%rip), %eax
movl %eax, n(%rip) addsd %xmm2, %xmm0
addsd %xmm0, %xmm1 addl $2, %eax
movsd 8(%rsp), %xmm0 movl %eax, n(%rip)
movsd 16(%rsp), %xmm2 movl n(%rip), %eax
movl n(%rip), %eax addsd %xmm1, %xmm0
mulsd %xmm2, %xmm0 addl $3, %eax
addl $3, %eax movl %eax, n(%rip)
movl %eax, n(%rip) movq 24(%rsp), %rax
movq 24(%rsp), %rax xorq %fs:40, %rax
xorq %fs:40, %rax jne .L12
addsd %xmm1, %xmm0 addq $40, %rsp
jne .L8 ret
addq $40, %rsp .L12:
ret call __stack_chk_fail@PLT
.L8:
call __stack_chk_fail@PLT
If we were to pore through it, we'd find that doh1() indeed does the slow function calls (inlined) in between (re)loading the double values and using the recently (re)loaded values for the product it adds to the sum; exactly as we/I/the author apparently intended it to work.
doh2() now has completely different machine code compared to doh1(), and indeed does the slow function calls (inlined) in a batch near the end of the function.
Simply put, volatile made GCC generate machine code for doh1() with exactly the order of side effects (increments of n) we want, exactly as I claimed.
I'm just not happy at how inefficient code GCC-7.5.0 generates here, at all. It does not detract from my argument, and kinda even reinforces the idea that no matter what the standards say, you're better off examining the actual output of your tools.
TL;DR: This long-ass examination of GCC-7.5.0 and Clang-10 output on AMD64, proves that even if my understanding of the C standard is wrong, the example case shown here shows that what I described does happen in real life, as it happens exactly as I described to this particular example code on AMD64. I could still be wrong, but even if I am, it means the situation is even more complex in reality, and while my understanding may need fixing, it does apply to at least this here case exactly.