Pointer confusion in C -language

#50 Reply
Posted by ataradov on 24 Jun, 2021 02:46
Yes, I noticed the code quality too. This was compiled with riscv64-unknown-elf-gcc from SiFive's binary distribution (gcc version 10.2.0 (SiFive GCC-Metal 10.2.0-2020.12.8 )).

The only optimization flag was -O3.

#51 Reply
Posted by brucehoult on 24 Jun, 2021 03:12
It's the same as long as you have at least -O.

Clang 11.0.1 does better:

Code: [Select]
foo: # @foo addi sp, sp, -16 addi a1, sp, 6 add a0, a0, a1 addi a1, zero, 1 sb a1, 0(a0) lbu a0, 0(a0) addi sp, sp, 16 ret
Though, again, the add 6 could be removed and 6 used as the offset for the sb and lbu.

I feel as if, after a few years of playing catchup, leading edge RISC-V compiler activity has shifted to LLVM in the last 6-12 months. For example support for both B and V extensions is better in LLVM.

#52 Reply
Posted by westfw on 24 Jun, 2021 03:41
The venerable PDP-10 had a bunch of instructions where the upper half of a register would hold a count (or negative count), and the bottom half would hold an address.You could step through arrays with "Add one to both halves and jump if negative" instructions, and IIRC push/pop would inc/dec both halves and check the count, so you could trap or detect either stack overflows or stack underflows (but not both.)(it all fell apart when people wanted to address more than a megabyte of stuff. :-( (36bit word size for both memory and registers, 18bit address, no byte addressability.)
But it seems like if you have 64bit registers, you could revive that sort of strategy. At least for "moderately sized and possible mapped" address spaces.

(although no one seems to care much. Once you get 16MB+ stacks, you're generally on platforms than implement VM and are already paying for pager context switching, and stack protections is only incrementally more.)

#53 Reply
Posted by Nominal Animal on 24 Jun, 2021 05:32
Quote from: brucehoult on 24 Jun, 2021 01:49
I'm well aware of the use of FS for thread local storage.
So why did you claim "pretty much never used", then? I'd say thread local storage on Linux in userspace is pretty damn far from "never used". I'm confused.

Note that this is NOT just a base register dedicated to point to the beginning of TLS; it is a real, completely separate address space. It just happens to be also mapped to be visible in the general address space, and has its starting address in the general address space stored at address zero, to make for C and C++ support easy. (g++ and clang (both C and C++) all generate the same machine instructions as gcc does, at least when using -O2 for all.)

#54 Reply
Posted by DiTBho on 24 Jun, 2021 08:19
Quote from: brucehoult on 24 Jun, 2021 01:10
This is a rather C-centric problem. Many other programming languages can't take the address of an auto variable.

Yup, eRlang doesn't have this problem.

#55 Reply
Posted by DiTBho on 24 Jun, 2021 08:57
Quote from: Nominal Animal on 24 Jun, 2021 01:38
Quote from: ataradov on 24 Jun, 2021 00:46
Quote from: Nominal Animal on 24 Jun, 2021 00:06
I know; you've made it clear having devices crash does not bother you, as long as it does no harm to you personally.
This is not true. There are already sufficient support in the hardware and software to catch stack overflows.
That statement pair is evidence of my claim.

The support for catching stack overflows is based on heuristics. It is not deterministic. By the very definition, your "sufficient" can only mean "to the extent that I care". I find the extent to which you care, lacking.

I worked in places where your job requires "Software Considerations in Mission Critical Systems and Mission Critical Equipment" certifications(1), I can assure you for things where a single software bug can kill people, there are commercial ICEs, simulators and tools to catch these defects deterministic-ally.

They is no public documentation, they are not open source, and you need a lot of money to buy them, but they exist!

(1) like life support systems for emergency landings in arctic areas. If the plane has a failure in flight, the firmware of the air-brake must work without any defect, and it's the only thing can save people's life during the landing, it makes the difference between crashing on ice (usually deadly) and having a chance to make a hard, violent, but safe landing.You need to test every single details of your firmware, you have powerful hardware and software tools to achieve it.

Talking about "air-brake", may be it a an air-brake "under steroids" will be used in future missions to Mars. The idea is to use super fast space-ship coupled with gravitational slingshot to reach Mars in shorter time (less than 5 months, probably 2 months), and to use the air-brake to decelerate entering the Martian atmosphere, kind of looooooooong landing with the air-brake activated full time.

The firmware must be perfect, a single error would make the mission fail and worse yet no-help can be sent to Mars (supposing someone survives the crash by using the airbag).

#56 Reply
Posted by brucehoult on 24 Jun, 2021 09:14
Because it's very very far from what the 386 designers intended.

Using *one* segment register for a trivial purpose -- mapped, as you say, into the usual address space, so there's no protection benefit, no expansion of address space benefit, no cheap remapping (within a process) benefit.

It really is using a segment register in a way that a base register would do just as good a job for, if the ISA wasn't so desperately short of registers.

The 386 designers would have thrown up their hands in despair if they knew that was what their work was going to come to.

#57 Reply
Posted by brucehoult on 24 Jun, 2021 09:28
Quote from: DiTBho on 24 Jun, 2021 08:57
Quote from: Nominal Animal on 24 Jun, 2021 01:38
Quote from: ataradov on 24 Jun, 2021 00:46
Quote from: Nominal Animal on 24 Jun, 2021 00:06
I know; you've made it clear having devices crash does not bother you, as long as it does no harm to you personally.
This is not true. There are already sufficient support in the hardware and software to catch stack overflows.
That statement pair is evidence of my claim.

The support for catching stack overflows is based on heuristics. It is not deterministic. By the very definition, your "sufficient" can only mean "to the extent that I care". I find the extent to which you care, lacking.

I worked in places where your job requires "Software Considerations in Mission Critical Systems and Mission Critical Equipment" certifications(1), I can assure you for things where a single software bug can kill people, there are commercial ICEs, simulators and tools to catch these defects deterministic-ally.

They is no public documentation, they are not open source, and you need a lot of money to buy them, but they exist!

:
:

Talking about "air-brake", may be it a an air-brake "under steroids" will be used in future missions to Mars. The idea is to use super fast space-ship coupled with gravitational slingshot to reach Mars in shorter time (less than 5 months, probably 2 months), and to use the air-brake to decelerate entering the Martian atmosphere, kind of looooooooong landing with the air-brake activated full time.

I worked here in NZ in 1999 with a guy who previously worked on a spacecraft called "Mars Climate Orbiter".

Apparently that deterministic catching of bugs wasn't good enough, because the mission was lost due to a mismatch between SI and colonial units.

You're going to tell me that was a long time ago and people were stupid then and such a thing can't happen now?

#58 Reply
Posted by Nominal Animal on 24 Jun, 2021 15:35
Quote from: brucehoult on 24 Jun, 2021 09:14
Because it's very very far from what the 386 designers intended.
Uh, who cares what the designers intended? The way they intended their design to be used was not the way it ended up being used. It would be pretty fair to say the '386 was a success not because of its design choices, but despite them.

The designers made the absolutely critical error of relying on descriptor tables anyway – and that's what killed it on '386, and got full segmented memory model support removed from AMD64. Descriptor tables, or mappings from arbitrary small integers as address space keys, "segment selectors", to the definitions of those address spaces, just are not an useful abstraction, and creates an extra unneeded point of failure particularly from a security standpoint.

Also, having the result of the segment mappings be an address in a single virtual address space was probably thought of as a necessity, because there was no (and still is no) named address space support in standard C. Nevertheless, that turned out to be a failure, because later processors had to implement PAE to get over the 4GB hump.

If, in the first place, the segments themselves would have had their own page tables, and not limited to a single virtual address space (which itself was then optionally paged in the '386), only the maximum consecutive memory region and the simultaneously addressable memory would have been limited (to 4GB, and number of simultaneous segments×4GB = 16 GB). So, by designing in an unified virtual address space, they shot themselves in the foot.

Right now, comparing the memory model in OpenCL to the neutered one provided by SYCL just to cater to compilers that do not want or cannot to provide named address space support, is an excellent repeat of the mistakes the i386 designers did in choosing how to implement segmented memory; they re-do the exact same proven erroneous choices (by never being used the way their designers intended), and apparently hope that this time it leads to different results.

I do recommend reading the A. Gozillon, P. Keir: Towards Programmable Address Spaces paper from 2017. I cannot say I wholeheartedly agree or support the choices they describe (and ended up being implemented, and is now available in for example Clang 10), but I did find it informative and interesting; and very relevant to address spaces and segmented memory in general. Note that OpenCL, as discussed in that paper, has a four-level memory hierarchy: "global", "constant", "local", "private". Although this hierarchy is based on the asymmetric multiprocessing hardware OpenCL runs on, the way this memory hierarchy is used, matches pretty darn well with the segmented memory features I'd like to see; call my "ideas" security paranoia and attempts at future-proofing it, to avoid the erroneous design choices already known to be erroneous.

For the topic at hand, assuming it is still something about pointers in embedded/freestanding/nonstandard-nonhosted C and C++ environments, those four also map very well to the address spaces I'd personally love to see in such environments, for security and robustness. The "constant" address space obviously matches the Flash and ROM currently ubiquitous; "local" and "private" match the two types of limited-duration/local-scope variables and objects (that Ataradov called auto), the former being the ones that need to be accessible to the caller or callees if nontrivial function calls are made in this scope, and the latter being those that are completely local to the current scope. "global" matches whatever hardware or physical address space the environment uses, if it has such a single unified address space. The missing one is globally accessible data, possibly split into static mutable objects and variables and dynamically managed mutable objects and variables.
I find it funky that what works fine for OpenCL, is considered "too hard" or "too complex" for the embedded/freestanding environments.

Quote from: ataradov on 24 Jun, 2021 01:50
Code: [Select]
int foo(int a) { volatile char zz[10]; zz[a] = 1; return zz[a]; }[...] Call this with foo(100) or foo(-100) and it will overwrite the memory well outside the stack, while SP is perfectly fine.

And this is the most common way stack overflows happen, so any system that does not catch this is not worth considering.
If by system you include both the compiler and the hardware, then I absolutely agree, and to both points in that last sentence.

And will revise my understanding of your actual attitude toward robustness and reliability accordingly. (Not that it matters to anyone but me, but I do find it important to point out my opinions are based on my observations, and when provided with new information, my opinions are likely to change.)

I do suspect that to truly fix this, we do need fundamental changes to C and C++.

Consider hardware that applies a check to each and every effective access using any stack pointer relative addressing modes. The check is a simple bounds check, perhaps written as (EA < base || EA >= limit), where base and limit are internal registers, and when the check triggers, a hardware interrupt is raised, with the effective address available in another internal register. (As discussed, this interrupt can default to just updating base or limit, becoming just stack size instrumentation.)

If the compiler does not add an extra software check similar to one verifying a >= 0 && a < 10 before zz[a] = 1 in ataradov's example, then even the above hardware effective address check would fail to catch
Code: [Select]
int bar(char *p, int b); int foo(int a, int b) { volatile char zz[10]; return bar(zz + a, b); }simply because the error occurs when the pointer value is constructed – it is out of bounds for the referred to object – and the pointer p function bar() receives, will not be dereferenced using stack pointer relative addressing anyway (because a single unified address space is used).

A lot of the blame can be placed on the programmer, too. If we wanted bar to be able to detect invalid indexing, we should declare it something like int bar(char *buf, size_t len, size_t index, int b); instead. The standard C library in particular could have better interfaces. It would only need one line of added code to check the value of a is a valid index to the zz array before constructing the pointer zz+a. And so on.

It is not an easy problem to solve; and is basically impossible, if the compiler developers choose not to participate.

For what it is worth, I have not found a combination of options to get gcc-7.5.0, g++-7.5.0, nor clang-10 compiling C or C++, to complain even a peep about my example above. Yet, it is something that immediately sticks in my eye when I look at code, exactly because it so often leads to annoying bugs.

#59 Reply
Posted by DiTBho on 24 Jun, 2021 16:28
Quote from: Nominal Animal on 24 Jun, 2021 15:35
Note that OpenCL, as discussed in that paper, has a four-level memory hierarchy: "global", "constant", "local", "private"

like the ijvm machine invented and described by Andrew S. Tanenbaum

four-level memory!
- global-pool
- constant-pool
- local (it means local stack, since ijvm is a stack-based machine)
- private

#60 Reply
Posted by SiliconWizard on 24 Jun, 2021 17:32
Quote from: ataradov on 24 Jun, 2021 01:50
Quote from: SiliconWizard on 24 Jun, 2021 00:55
Ah, that's not true, at least on RISC-V
So what is the problem overflowing it from the other side? Or really any side you want with the code like this:

Code: [Select]
int foo(int a) { volatile char zz[10]; zz[a] = 1; return zz[a]; }
And the resulting assembly:

Code: [Select]
0000000000000000 <foo>: 0: 1141 addi sp,sp,-16 2: 081c addi a5,sp,16 4: 953e add a0,a0,a5 6: 4785 li a5,1 8: fef50823 sb a5,-16(a0) // OOPS c: ff054503 lbu a0,-16(a0) 10: 0141 addi sp,sp,16 12: 0ff57513 zext.b a0,a0 16: 8082 ret
Call this with foo(100) or foo(-100) and it will overwrite the memory well outside the stack, while SP is perfectly fine.

And this is the most common way stack overflows happen, so any system that doers not catch this is not worth considering.

You're mixing two things. They may be equally bad, but two different things nonetheless.

The first thing is the typical stack overflow, which I was talking about. That would come from using more stack than is available, usually due to a greater call depth than expected and/or too much data allocated on the stack within one particular function.

What you are showing here is just what we call a buffer overflow, and more generally speaking, is writing (or reading) at a memory location that is not supposed to be accessed by the piece of code of interest. It can happen in all kinds of situations, not just when said memory is supposed to be a "stack".

You mentioned it probably because "buffer overflows" are a very typical and very well known software security issue, but this is a separate issue from pure stack overflows.

#61 Reply
Posted by SiliconWizard on 24 Jun, 2021 17:37
Quote from: brucehoult on 24 Jun, 2021 01:10
Quote from: ataradov on 23 Jun, 2021 22:58
But auto variables are stored on the stack, and they should be accessed by regular instructions.

On modern machines with a sufficient number of registers, auto variables are stored in registers UNLESS they have their address taken.

If you run out of registers then some auto variables may be on the stack. The function that declares them knows this, and can use the special stack access instructions.

The difficulty is if the address is taken and passed to another function. Then the other function that uses the address has no way of knowing that special stack access instructions are needed.

This is a rather C-centric problem. Many other programming languages can't take the address of an auto variable.

Yes, but this doesn't make it a C-centric problem.
Many languages don't allow directly taking the address of a variable, be it on the stack or anywhere else. Some only on the stack. But that's from a programmer's POV. Many other languages still allow calling functions on local variables passed "by reference", which is essentially getting an address to it behind the scenes. So that's the same. It's just that said address is not directly accessible to the programmer.

#62 Reply
Posted by Nominal Animal on 24 Jun, 2021 19:47
Quote from: SiliconWizard on 24 Jun, 2021 17:32
stack overflow [versus] buffer overflow [on a stack based buffer]
Very good point; I missed that myself.

Perhaps it is a good idea to remind oneself that on small microcontrollers with limited RAM, heap and stack are typically the opposite ends of a single continuous block of RAM. Dynamic memory allocations reduce the space left for the stack (unless they use a hole left from an earlier allocation since freed), so basically we have this waterline that varies at runtime (indicating the end of currently allocated dynamic memory with the hightest address) that the stack must not cross.

One reason runtime heuristics like stack canaries have such a bad time detecting this before the device has already crashed and pooped all over, is that that waterline does not stay constant, it moves (if any dynamic memory allocations are done), and it could be either a dynamic memory allocation or the stack growing that caused the waterline to be crossed.

Now, add a nasty buffer overrun – especially the kind that does not just fill an array over its allocated size, but nefariously accesses/modifies a single byte or a group of buffer entries way past the end (or the beginning) of the buffer – and you get the kind of bugnest that can cause one to decide to switch to woodworking. At least there you get to use a hammer on any bugs you see. Canaries are rather unlikely to happen to be exactly where that access ended up modifying memory, so may not help at all.

(That said, off by one errors seem to be the most common buffer overrun cases, i.e. overwriting a byte/int just preceding or immediately succeeding the intended object. Those are relatively easy to catch. But the nasty ones are the jumpy ones, as they can be very hard to spot in the code. Integer promotion causing sign extension on something that was intended to be an unsigned value can be very hard to spot, and if they occur at an index calculation, the end result can be way off. This is one reason you'll see my own code using way more explicit casts than what are technically required; since the casts typically only cost human observation and do not generate extra machine code, I consider it an appropriate way to try and avoid some of those nasty kangaroo indexing bugs. A semi-related case in point: how many C programmers know or care that if they happen to have a char or int c, the proper way to test in a hosted environment if c is a whitespace character, is NOT isspace(c), but isspace((unsigned char)c)?)

On an embedded architecture, it would be rather nice to have that waterline address in a special register, even one that is relatively slow to access and update, if the stack pointer crossing it would cause a hardware interrupt. It would not help with the buffer under/overrun/overflow bugs, but it would make the stack waterline crossing detection at runtime, deterministic.

I can even imagine/describe a couple of C programming patterns (admittedly using setjmp()/longjmp() which I do not like at all to use) that could set up a safe state to revert to if a waterline crossing event were to occur, so that a reboot or crash could actually be avoidable in many situations. (It won't complete/revert I/O done meanwhile, so it is more about cancelling computational rather than I/O work when that work cannot be done with the currently available stack space.)

#63 Reply
Posted by SiliconWizard on 24 Jun, 2021 21:00
Quote from: Nominal Animal on 24 Jun, 2021 19:47
Perhaps it is a good idea to remind oneself that on small microcontrollers with limited RAM, heap and stack are typically the opposite ends of a single continuous block of RAM. Dynamic memory allocations reduce the space left for the stack (unless they use a hole left from an earlier allocation since freed), so basically we have this waterline that varies at runtime (indicating the end of currently allocated dynamic memory with the hightest address) that the stack must not cross.

Yes, that is the usual layout.

I tend to avoid dynamic allocation on embedded stuff. But when I have to use it, here is what I do: I write a linker script so as to reserve space for the stack. It exports a symbol with the lowest stack address. Then I implement the _sbrk() function so that dynamic allocations can never overflow into the stack. Such that if a dynamic allocation would get into the stack, it will just fail (returning a NULL).

Reserving the stack in the linker script also prevents static allocation from decreasing the usable stack size.

Drawback of this scheme is that of course, now you have a fixed reserved stack space that can't be used for anything else, but I wouldn't trade this for the ability to allocate more heap if not all stack is used, or conversely. Way too slippery.

Of course, this scheme prevents heap allocation from eating the reserved stack, but it doesn't prevent the stack from overflowing. And this is where an hardware-based check would be useful.

#64 Reply
Posted by brucehoult on 24 Jun, 2021 21:43
Quote from: SiliconWizard on 24 Jun, 2021 17:37
Many languages don't allow directly taking the address of a variable, be it on the stack or anywhere else. Some only on the stack. But that's from a programmer's POV. Many other languages still allow calling functions on local variables passed "by reference", which is essentially getting an address to it behind the scenes. So that's the same. It's just that said address is not directly accessible to the programmer.

Which ones, that anyone uses today? Fortran?

Note that neither Pascal's "var" nor Ada's "in out" require by reference.

Java, Python etc don't have address-taking.

#65 Reply
Posted by DiTBho on 25 Jun, 2021 08:09
Quote from: brucehoult on 24 Jun, 2021 21:43
Which ones, that anyone uses today? Fortran?
Note that neither Pascal's "var" nor Ada's "in out" require by reference.
Java, Python etc don't have address-taking.

Technically, Fortran is mandatory in a couple of Linux distributions.
Code: [Select]
# mandatory languages enable-languages += c,c++,fortranI don't know what for, I know that as an administrator I have to spend a lot of my time building gcc with it enabled and properly patched.

Both the rootfs for the Jetson and Coral dev-clusters have dependencies with fortran

Speaking about things I have to support, it appears that PHP requires var by reference in functions that require "in out".
Code: [Select]
function do_foo ( &$core /* in out */ )I don't know how PHP interpreters are implemented and invoked by web-server (e.g. apache2 PHP-mod), but I have here a PHP compiler written by a crazy guy from a company I collaborate with, it's quite a personal project, but two months ago he started using it for some things in production. That's really crazy.

I hope no one follows his idea, but who knows?

#66 Reply
Posted by Veketti on 25 Jun, 2021 08:29
These pointers have been like voodoo to me and never really had to get involved with them. So far I’ve been managing with functions returning values. That is easier to understand. But I’m starting to get the hang of it, because of you. However is there a case when you should use function return instead of pointers?

Then about the different variable names in caller and callee. So if caller sends Mike to callee and in callee it is called Tiffany. It doesn’t matter as Mike and Tiffany has the same address? Let’s say Mike’s memory address is 7, we pass just address 7 to callee and don’t care about the names we call them. Did I understood right? This is confusing, why would you do that, if you’re not meaning to confuse.

Then regarding this:
Code: [Select]
a[b] == *(a + b) == *(b + a) == b[a]
Lets assume
Code: [Select]
a{11,22,33,44} and b==3 then “a[b]” a[3] ==33, but how come b[a] is equal? Does not make sense to me, please explain?

Then if it’s ok to bring volatile here as well. If I have global variable which is used in two threads, I must declare it to be volatile. However is there a case that global variable shouldn’t be declared as volatile? And could they always be volatile as default?

Thanks again for your help.

#67 Reply
Posted by Nusa on 25 Jun, 2021 10:31
Quote from: Veketti on 25 Jun, 2021 08:29
These pointers have been like voodoo to me and never really had to get involved with them. So far I’ve been managing with functions returning values. That is easier to understand. But I’m starting to get the hang of it, because of you. However is there a case when you should use function return instead of pointers?

Then about the different variable names in caller and callee. So if caller sends Mike to callee and in callee it is called Tiffany. It doesn’t matter as Mike and Tiffany has the same address? Let’s say Mike’s memory address is 7, we pass just address 7 to callee and don’t care about the names we call them. Did I understood right? This is confusing, why would you do that, if you’re not meaning to confuse.

Then regarding this:
Code: [Select]
a[b] == *(a + b) == *(b + a) == b[a]
Lets assume
Code: [Select]
a{11,22,33,44} and b==3 then “a[b]” a[3] ==33, but how come b[a] is equal? Does not make sense to me, please explain?

Then if it’s ok to bring volatile here as well. If I have global variable which is used in two threads, I must declare it to be volatile. However is there a case that global variable shouldn’t be declared as volatile? And could they always be volatile as default?

Thanks again for your help.

Actually, in your example, a[3] is 44, not 33. C is zero-indexed, even if the real world likes to start counting at 1.
Lets say that a points to memory address 1000, and b is 3. a[3], *(a+b),*(1000+3) are equivalent, no? Ditto for b[a], *(b+a), *(3+1000). The address is just math under the surface, and it's commutative because of math.

As for volatile, that's a keyword that tells the compiler that it can't make assumptions about the value of the variable when optimizing -- it forces an actual check of the memory value every time instead of reusing a register value. If you want an analogy....if you're the only one using a blackboard for data, you don't bother looking at the blackboard if you remember what you wrote. But if more than one person is using the blackboard, what you wrote may have been changed by the other guy when you weren't looking. So you have to look every time to get the current value.

#68 Reply
Posted by Veketti on 25 Jun, 2021 11:45
Ah, yes of course 44, that was brainfart from me, I forgot it starts from 0.
Thanks for your explanation.

#69 Reply
Posted by Siwastaja on 25 Jun, 2021 13:33
a[ b ] being equivalent to b[ a ] is most often just a funny remark, I don't remember ever seeing actual use for this. After all, [] is eye candy making things more readable, and idx isn't readable. But I'm sure there's some obscure use for this I haven't seen.

#70 Reply
Posted by Nominal Animal on 25 Jun, 2021 16:33
Quote from: Veketti on 25 Jun, 2021 08:29
However is there a case that global variable shouldn’t be declared as volatile? And could they always be volatile as default?
Declaring a variable volatile is always safe, just potentially inefficient.

You see, the C standards define volatile as
Quote
Accesses to volatile objects are evaluated strictly according to the rules of the abstract
machine.
and points out in a footnote that
Quote
An implementation might define a one-to-one correspondence between abstract and actual semantics: at every sequence point, the values of the actual objects would agree with those specified by the abstract semantics. The keyword volatile would then be redundant.
Indeed, some C compilers did do just that.

A core method current C compilers generate much more efficient code, is that if an object is not examined, and it is not volatile, its value does not matter. (This is also why you will see all memory-mapped I/O register objects in C and C++ declared volatile. If they are not, it is a bug.)

The way I define volatile may not be exactly correct (in the language lawyer sense), but it is very useful intuitive definition and correct in the real world: it tells the compiler that the object may be concurrently modified by some other code the compiler does not know about, and therefore the compiler must not, is not allowed to, make any assumptions. Without volatile, an assumption a C compiler can make, is for example that if object foo is not modified by any code the compiler knows about between sequence points X and Y, the compiler can reuse the value of foo it had at sequence point X at the later sequence point Y.

For example, if you have say
Code: [Select]
double doh(const double *const xref, const double *const yref, const double *const zref) { double result; result = (*xref) * (*yref); do_something_slow_1(); result += (*xref) * (*zref); do_something_slow_2(); result += (*yref) * (*zref); do_something_slow_3(); return result; }a C compiler is free to generate the same machine code as it would for
Code: [Select]
double doh(const double *const xref, const double *const yref, const double *const zref) { const double x = *xref; const double y = *yref; const double z = *zref; do_something_slow_1(); do_something_slow_2(); do_something_slow_3(); return x*y + x*z + y*z; }only because the pointers do not point to volatile doubles, and result is only observable within its local scope (and not in do_something_slow_n() functions).

If the pointers were declared as const volatile double *, then the compiler would NOT be allowed to do this: it would have to dereference the pair of pointers between the calls to do_something_slow_n() functions, to acquire the values of the referred to objects without "caching" them across sequence points.

To see why volatile matters, just consider another thread modifying the values that xref, yref, and zref point to, during the calls to the the do_something_slow_n() functions. The result you obtain from the function call then depends on whether you declare the values the pointers point to volatile or not. (Declaring the pointer variable itself volatile, say const double *const volatile xref, would be silly, because it'd tell the compiler that the pointer may be modified by some unseen code.)

In all cases, having the volatile there means the compiler will follow the C standard abstract machine model more strictly, so if you ever find code that behaves correctly without volatile, and incorrectly with volatile, then that code is strange and very suspect indeed; it must rely on the compiler to generate the code in some specific way, regardless of what the C standard says the compiler is allowed or should do in that situation. Bad, bad code, that; needs a rewrite for sure.

The final wrinkle is exactly what a sequence point is in the C standard. Fortunately, the standards have an informal annex (so not "this is what it is", but "we the standard writers believe that the sequence points are these, but if the text of the standard disagrees, then the text of the standard is right and this list wrong") stating that sequence points are:
- Between the evaluations of the function designator and actual arguments in a function
  call and the actual call
- Between the evaluations of the first and second operands of logical AND (&&), logical OR (||), and the comma operator (,)
- Between the evaluations of the first operand of the conditional ? : operator and
  whichever of the second and third operands is evaluated
- The end of a full declarator
- Between the evaluation of a full expression and the next full expression to be
  evaluated. (Full expressions being an initializer that is not part of a
  compound literal, the expression in an expression statement, the
  controlling expression of a if or switch selection statement, the
  controlling expression of a while or do statement, each of the (optional)
  expressions of a for statement, and the (optional) expression in a return
  statement.)
- Immediately before a library function returns
- After the actions associated with each formatted input/output function conversion
  specifier
- Immediately before and immediately after each call to a comparison function, and
  also between any call to a comparison function
according to the final published draft of the C11 standard, also known as n1570.pdf. Sequence points themselves are just the concept of how the C standard defines the order of effects. Between two sequence points, effects or side effects can occur in whatever order; but generally speaking, the sequence points are defined such that each useful effect or observable result or side effect is nicely bracketed between two sequence points.

#71 Reply
Posted by Nominal Animal on 25 Jun, 2021 18:27
I anticipate that there might be some discussion looming whether a C compiler is allowed to generate the same code for the two doh() functions I showed.

Instead of getting bogged down in language-lawyerism, let's expand it a bit into a complete example we can compile and examine:
Code: [Select]
static volatile int n = 0; void do_something_slow_1(void) { n += 1; } void do_something_slow_2(void) { n += 2; } void do_something_slow_3(void) { n += 3; } static inline double doh1i(const double *const xref, const double *const yref, const double *const zref) { double result; result = (*xref) * (*yref); do_something_slow_1(); result += (*xref) * (*zref); do_something_slow_2(); result += (*yref) * (*zref); do_something_slow_3(); return result; } static inline double doh2i(const double *const xref, const double *const yref, const double *const zref) { const double x = *xref; const double y = *yref; const double z = *zref; do_something_slow_1(); do_something_slow_2(); do_something_slow_3(); return x*y + x*z + y*z; } double doh1(const int ix, const int iy, const int iz) { const double x = ix, y = iy, z = iz; return doh1i(&x, &y, &z); } double doh2(const int ix, const int iy, const int iz) { const double x = ix, y = iy, z = iz; return doh2i(&x, &y, &z); }

Using clang-10 -Wall -O2 -std=c11 -c doh.c this compiles to
Code: [Select]
doh1: doh2: cvtsi2sd %edi, %xmm1 cvtsi2sd %edi, %xmm1 cvtsi2sd %esi, %xmm0 cvtsi2sd %esi, %xmm0 cvtsi2sd %edx, %xmm2 cvtsi2sd %edx, %xmm2 addl $1, n(%rip) addl $1, n(%rip) movapd %xmm1, %xmm3 addl $2, n(%rip) mulsd %xmm0, %xmm3 addl $3, n(%rip) mulsd %xmm2, %xmm1 movapd %xmm1, %xmm3 addsd %xmm3, %xmm1 mulsd %xmm0, %xmm3 addl $2, n(%rip) mulsd %xmm2, %xmm1 mulsd %xmm2, %xmm0 addsd %xmm3, %xmm1 addsd %xmm1, %xmm0 mulsd %xmm2, %xmm0 addl $3, n(%rip) addsd %xmm1, %xmm0 retq
You do not need to understand AT&T syntax AMD64 assembly (which has source on the left and destination on the right, opposite to Intel syntax): All you need to know is that the instructions that load the doubles from memory have the -offset(%rsp), %xmmN format, and the slow function calls correspond to the addl $1, n(%rip) instructions.

Simply put, Clang-10 keeps the instruction order basically intact even without the volatile.

GCC-7.5.0 (gcc -Wall -O2 -std=c11 -c doh.c) generates
Code: [Select]
doh1: doh2: pxor %xmm2, %xmm2 pxor %xmm2, %xmm2 movl n(%rip), %eax movl n(%rip), %eax pxor %xmm3, %xmm3 pxor %xmm1, %xmm1 pxor %xmm1, %xmm1 pxor %xmm3, %xmm3 cvtsi2sd %edi, %xmm2 cvtsi2sd %edi, %xmm2 addl $1, %eax addl $1, %eax cvtsi2sd %edx, %xmm3 cvtsi2sd %esi, %xmm1 movl %eax, n(%rip) movl %eax, n(%rip) cvtsi2sd %esi, %xmm1 cvtsi2sd %edx, %xmm3 movl n(%rip), %eax movl n(%rip), %eax addl $2, %eax addl $2, %eax movl %eax, n(%rip) movl %eax, n(%rip) movl n(%rip), %eax movl n(%rip), %eax addl $3, %eax addl $3, %eax movl %eax, n(%rip) movl %eax, n(%rip) movapd %xmm2, %xmm4 movapd %xmm2, %xmm0 mulsd %xmm3, %xmm2 mulsd %xmm3, %xmm2 mulsd %xmm1, %xmm4 mulsd %xmm1, %xmm0 mulsd %xmm3, %xmm1 mulsd %xmm3, %xmm1 movapd %xmm2, %xmm0 addsd %xmm0, %xmm2 addsd %xmm4, %xmm0 addsd %xmm1, %xmm2 addsd %xmm1, %xmm0 movapd %xmm2, %xmm0 ret retwhich is basically identical for both, aside from register naming differences.

Language-lawyerism aside, it means that if you use GCC-7.5.0, with this kind of a code pattern, what I described in my previous post will happen to you too:
without volatile, the two versions of doh() will generate the same machine code.

The instruction pattern GCC-7.5.0 generates for updating the counter n is
movl n(%rip), %eax
addl $N, %eax
movl %eax, n(%rip)
which annoys the heck out of me. It is not just the sane addl $N, n(%rip) clang-10 uses, and I cannot fathom why; I thought this kind of superfluous register dance was more or less fixed a couple of major versions ago... This is also why I don't trust compilers any further than I examine their output, and is the reason why I use extended inline assembly functions for oddball memory-mapped I/O register accesses: to ensure the exact instruction I want will be used.

Nevertheless, I should be happy, because it backs up my argument. (I'm not, because I don't want to win. I want to help others write better code, and especially to write and show me better code than I myself can write, because I'm selfish and self-centered and only care about winning my past self. That dude was an asshole.)

If we replace const double *const with const volatile double *const, then clang-10 generates
Code: [Select]
doh1: doh2: cvtsi2sd %edi, %xmm0 cvtsi2sd %edi, %xmm0 cvtsi2sd %esi, %xmm1 movsd %xmm0, -8(%rsp) movsd %xmm0, -8(%rsp) xorps %xmm0, %xmm0 movsd %xmm1, -16(%rsp) cvtsi2sd %esi, %xmm0 xorps %xmm0, %xmm0 cvtsi2sd %edx, %xmm1 cvtsi2sd %edx, %xmm0 movsd %xmm0, -16(%rsp) movsd %xmm0, -24(%rsp) movsd %xmm1, -24(%rsp) movsd -8(%rsp), %xmm0 movsd -8(%rsp), %xmm1 mulsd -16(%rsp), %xmm0 movsd -16(%rsp), %xmm0 addl $1, n(%rip) movsd -24(%rsp), %xmm2 movsd -8(%rsp), %xmm1 addl $1, n(%rip) mulsd -24(%rsp), %xmm1 addl $2, n(%rip) addl $2, n(%rip) addl $3, n(%rip) addsd %xmm0, %xmm1 movapd %xmm1, %xmm3 movsd -16(%rsp), %xmm0 mulsd %xmm0, %xmm3 mulsd -24(%rsp), %xmm0 mulsd %xmm2, %xmm1 addsd %xmm1, %xmm0 addsd %xmm3, %xmm1 addl $3, n(%rip) mulsd %xmm2, %xmm0 retq addsd %xmm1, %xmm0 retqthe difference being that now doh1() has exactly the behaviour we/I/the author intended.

Like I claimed, volatile stops clang-10 from generating the same code it does for doh2(). This means we can use volatile as I described in my previous post to control what kind of assumptions the compiler can make. Here, we want to do the slow calls in between accesses to the pointed-to doubles, so we need to tell the compiler the pointed-to objects are volatile, and it does what we expect it to. Nice.

As I'm writing this, I'm seriously considering switching from gcc-7.5.0 to clang-10 on at least AMD64. I didn't realize before that clang-10 output is that much better.

Anyway, here is the GCC-7.5.0 output, when function parameters are declared as const volatile double *const ref:
Code: [Select]
doh1: doh2: pxor %xmm0, %xmm0 pxor %xmm0, %xmm0 subq $40, %rsp subq $40, %rsp movq %fs:40, %rax movq %fs:40, %rax movq %rax, 24(%rsp) movq %rax, 24(%rsp) xorl %eax, %eax xorl %eax, %eax cvtsi2sd %edi, %xmm0 cvtsi2sd %edi, %xmm0 movsd %xmm0, (%rsp) movsd %xmm0, (%rsp) pxor %xmm0, %xmm0 pxor %xmm0, %xmm0 movsd (%rsp), %xmm1 movsd (%rsp), %xmm2 cvtsi2sd %esi, %xmm0 cvtsi2sd %esi, %xmm0 movsd %xmm0, 8(%rsp) movsd %xmm0, 8(%rsp) pxor %xmm0, %xmm0 pxor %xmm0, %xmm0 cvtsi2sd %edx, %xmm0 movsd 8(%rsp), %xmm1 movsd %xmm0, 16(%rsp) cvtsi2sd %edx, %xmm0 movsd 8(%rsp), %xmm0 movsd %xmm0, 16(%rsp) movl n(%rip), %eax movapd %xmm2, %xmm0 mulsd %xmm0, %xmm1 movsd 16(%rsp), %xmm3 addl $1, %eax movl n(%rip), %eax movl %eax, n(%rip) mulsd %xmm1, %xmm0 movsd (%rsp), %xmm0 mulsd %xmm3, %xmm2 movsd 16(%rsp), %xmm2 mulsd %xmm3, %xmm1 movl n(%rip), %eax addl $1, %eax mulsd %xmm2, %xmm0 movl %eax, n(%rip) addl $2, %eax movl n(%rip), %eax movl %eax, n(%rip) addsd %xmm2, %xmm0 addsd %xmm0, %xmm1 addl $2, %eax movsd 8(%rsp), %xmm0 movl %eax, n(%rip) movsd 16(%rsp), %xmm2 movl n(%rip), %eax movl n(%rip), %eax addsd %xmm1, %xmm0 mulsd %xmm2, %xmm0 addl $3, %eax addl $3, %eax movl %eax, n(%rip) movl %eax, n(%rip) movq 24(%rsp), %rax movq 24(%rsp), %rax xorq %fs:40, %rax xorq %fs:40, %rax jne .L12 addsd %xmm1, %xmm0 addq $40, %rsp jne .L8 ret addq $40, %rsp .L12: ret call __stack_chk_fail@PLT .L8: call __stack_chk_fail@PLTIf we were to pore through it, we'd find that doh1() indeed does the slow function calls (inlined) in between (re)loading the double values and using the recently (re)loaded values for the product it adds to the sum; exactly as we/I/the author apparently intended it to work.
doh2() now has completely different machine code compared to doh1(), and indeed does the slow function calls (inlined) in a batch near the end of the function.

Simply put, volatile made GCC generate machine code for doh1() with exactly the order of side effects (increments of n) we want, exactly as I claimed.

I'm just not happy at how inefficient code GCC-7.5.0 generates here, at all. It does not detract from my argument, and kinda even reinforces the idea that no matter what the standards say, you're better off examining the actual output of your tools.

TL;DR: This long-ass examination of GCC-7.5.0 and Clang-10 output on AMD64, proves that even if my understanding of the C standard is wrong, the example case shown here shows that what I described does happen in real life, as it happens exactly as I described to this particular example code on AMD64. I could still be wrong, but even if I am, it means the situation is even more complex in reality, and while my understanding may need fixing, it does apply to at least this here case exactly.

#72 Reply
Posted by SiliconWizard on 25 Jun, 2021 18:59
Quote from: brucehoult on 24 Jun, 2021 02:42
Wow gcc could use some improvement there.

The second instruction is unnecessary, as the 2nd instruction could just be add a0,a0,sp and change the offsets on sb and lbu to the more natural 0. And the zext.b is completely unnecessary as the byte was loaded unsigned.

Yep. I tried this with latest GCC 11.1.0. The unnecessary zext.b is not generated, but the rest of the sequence is the same.

Quote from: brucehoult on 24 Jun, 2021 02:42
I guess you compiled this for rv64 as I got -12 offset on the sb/lbu in rv32.

Yep, that's for RV64. This is what I get for RV32:
Code: [Select]
foo: addi sp,sp,-16 addi a5,a0,16 add a0,a5,sp li a5,1 sb a5,-12(a0) lbu a0,-12(a0) addi sp,sp,16 jr ra
Anyway. This illustrates what I said earlier: the 'sp' register is a good indicator of how much stack is being used.

As to again what I said above: do not confuse "stack overflows" with "stack-based buffer overflows". And, do not confuse monitoring and protection.

#73 Reply
Posted by gf on 25 Jun, 2021 22:03
Quote
The instruction pattern GCC-7.5.0 generates for updating the counter n is
movl n(%rip), %eax
addl $N, %eax
movl %eax, n(%rip)
which annoys the heck out of me.

It obviously does not generate the movl,add,movl sequence if the variable n is not volatile. Seems to me that gcc wants to clearly separate the volatile fetch and the volatile store ¹⁾, by avoiding memory read-modify-write instructions. Still I'm not sure whether this really makes a difference, unless (non-atomic) RMW instructions would lead to a different bus access pattern than separate fetch + store instructions. Does it, possibly?

¹⁾ "Every access (both read and write) made through an lvalue expression of volatile-qualified type is considered an observable side effect for the purpose of optimization and is evaluated strictly according to the rules of the abstract machine..."

#74 Reply
Posted by Nusa on 25 Jun, 2021 22:27
Do you really think the essay length discussions of compiler details is of any help to the original poster who is still asking very basic questions? It's sort of like talking about different tire treads to someone who is still learning to steer. Overwhelming and mostly irrelevant to what he needs to know right now.