Author Topic: GCC compiler and optimization. (Read 5229 times)

luiHS · « **on:** December 09, 2020, 02:15:10 am »

Does anyone know the reasons why sometimes using GCC optimization makes the program not work?. I am working with MCUXpresso and Kinetis microcontrollers from NXP. This optimization thing is a real nightmare at times.

I needed to turn optimization on in a program, as I was not capturing some interrupt-managed data well. Even leaving the interrupt routines very clean, I had data loss problems.

I have another similar program that did allow me to optimize and it worked fine.

Finally debugging, I came up with a routine that does the same functions as millis () and another delay () also based on the system's SysTick. The problem is that this routine was already being used by a library to handle SD cards that I had put in my program, so I could not define it again. I simply declared the SD library variable as extern in my program and used it in my millis () routine. This worked perfectly until I compiled with optimization then it stopped working.

I finally got it solved, but it was a nightmare, I just declared these variables as "volatile". And so I have been finding other variables that the same thing happened to them when activating the GCC optimization, they stop updating unless I declare them as "volatile".

I see that there are 3 optimization levels, another to reduce program size and another for Debug. And also the Debug level has three possible degrees of optimization.

Cerebus · « **Reply #1 on:** December 09, 2020, 03:27:39 am »

During optimisation the compiler will do data flow analysis. If it does not see your code make use of a stored value in memory during that analysis it will assume that it is OK for it to keep that value in a register, possibly even eliminate it entirely (e.g. loop induction valuables often get this treatment if you don't use them in the loop). So the value that you use, in a separate chunk of code that the compiler knows nothing about, such as an interrupt routine, is currently being stored in a register, not in memory, and when you access memory you get a stale value.

The purpose of volatile is to make explicit that "this variable may be changed, or used by, code that you don't know about Mr. Compiler, so please make sure that you keep the memory value up to date and don't optimise away any memory accesses".

This is a truly fundamental thing in writing concurrent code, understanding when a shared variable can be used in two places simultaneously (if you have multiple cores or processors sharing memory) or apparently simultaneously (interrupts on the same core/processor). It's quite possible for a variable held in memory to change from when you load it into a register from memory, to when you actually use it's value in the register an instruction or two later. All this stuff ought to be covered in a basic computer science course on concurrency or concurrent programming. There are standard methods for avoiding these problems and proper use of volatile is part of that, as is understanding how to use mutual exclusion to ensure that a shared variable gets updated in a fashion that looks atomic to all parties that use it.

You can see some simple examples of what optimisation the LLVM compiler backend does and doesn't do in the presence of volatile over here: https://www.eevblog.com/forum/general-computing/apples-new-m1-microprocessor/msg3350102/#msg3350102.

newbrain · « **Reply #2 on:** December 09, 2020, 08:45:38 am »

Quote from: luiHS on December 09, 2020, 02:15:10 am

This optimization thing is a real nightmare at times.

Translation:
"This optimization thing is a good test to see if my code is sound"

Of course, compilers have bugs too, but they're quite rare.

cv007 · « **Reply #3 on:** December 09, 2020, 09:07:36 am »

>This optimization thing is a real nightmare at times.

You may possibly hear some protests about this- just set the optimization to -Os from the start. No need to find out after you spent a month tweaking all your code, that it all starts to break now that you want to enable more optimizations. I go straight to -Os all the time, no stops in between. Now I know right away when I write something the compiler thinks is useless and optimizes away for example, I can correct the problem soon after I write it and not a month later when I have forgotten the details. Since I use only 1 optimization level, I'm also always looking at the same asm listing. and not looking at asm code that looks different depending on optimization level currently in use.

You can produce correct code in any optimization level (correct code works in any optimization level), but I would rather battle the compiler right away in its most aggressive state rather than have it let me make potential mistakes because it lacks enthusiasm.

GromBeestje · « **Reply #4 on:** December 09, 2020, 09:17:03 am »

Quote from: cv007 on December 09, 2020, 09:07:36 am

You can produce correct code in any optimization level (correct code works in any optimization level), but I would rather battle the compiler right away in its most aggressive state rather than have it let me make potential mistakes because it lacks enthusiasm.

The thing with that is, while developing I would like to be able to step through the code line-by-line. That won't work when you optimise.

But yeah, everything shared that gets modified in an interrupt should be volatile. Maybe I could add some examples

Code: [Select]

bool button_pressed_flag;
void button_pressed_Handler(void) {
  button_pressed_flag = true;
}
int main() {
  button_pressed_flag = false;
  while (1) {
    if (button_pressed_flag) {
      button_pressed_flag = false;
      toggle_led();
    }
  }
}

When optimisation is enabled, the compiler sees button_pressed_flag is set to false, and not written to after. Therefore, the compiler assumed it will always be false and optimises to if (false) and just remove the whole case.

Siwastaja · « **Reply #5 on:** December 09, 2020, 09:57:49 am »

97% of time, only already broken code "breaks" when you adjust optimization settings: some hidden bugs become visible. Do note, compiling with less optimization does not mean the absence of such bugs at all. They are all there any may pop up unexpectedly any time.

About 3% of time, you see an actual compiler bug, or a malicious language lawyer participating in the compiler design so that commonly used, sane coding structures with the right intent but technically undefined behavior make it break. Such behaviour have caused problems between linux kernel developers and gcc developers a few times, but isn't very common, you are unlikely to hit such a problem.

Ignore advice of "turning optimization off to save time so that you can continue writing broken code", it's unbelievable you still hear this argument. The only way to save time in the long run is to learn writing even remotely correct code (and admit as a human being you will be making mistakes, and you can fix them, and learn). Adjusting optimization settings to make broken code seemingly "work" is error-prone and risky work and subject to environmental variations such as having the compiler version change - or to leave some hard-to-reproduce bugs behind nevertheless.

C isn't trivial.

It's a good idea to keep the optimization on, not only to get a program with better performance, but also to force you to write even remotely correct code. The problem with the "debug the shit out of the system by single-stepping everything" mindset is that for that to work, you want to compile at low optimization level (to make the generated assembly match the C source better, enabling decent simple-stepping of C source). This detaches the system-under-debug from the actual system.

newbrain · « **Reply #6 on:** December 09, 2020, 10:26:10 am »

Quote from: Siwastaja on December 09, 2020, 09:57:49 am

C isn't trivial.

This should be emblazoned across every C programmer's soul in fire letters.

GromBeestje · « **Reply #7 on:** December 09, 2020, 11:51:03 am »

Quote from: newbrain on December 09, 2020, 10:26:10 am

Quote from: Siwastaja on December 09, 2020, 09:57:49 am
C isn't trivial.
This should be emblazoned across every C programmer's soul in fire letters.

And one thing to keep in mind this is outside of the scope of C. At least for C89, there was no notion in the language of things like interrupts (or threads when running on an OS). It assumed only the code can influence program flow. Newer C standards such as C11 have a notion of threads and thus have the notion of interrupted program flow, but prior to that, there was no notion of such as part of the language. However, there is still volatile in older C, as to support memory mapped I/O (a register in an MCU is accessed by its memory address)

newbrain · « **Reply #8 on:** December 09, 2020, 02:08:21 pm »

Quote from: GromBeestje on December 09, 2020, 11:51:03 am

At least for C89, there was no notion in the language of things like interrupts (or threads when running on an OS). It assumed only the code can influence program flow. Newer C standards such as C11 have a notion of threads and thus have the notion of interrupted program flow, but prior to that, there was no notion of such as part of the language. However, there is still volatile in older C, as to support memory mapped I/O (a register in an MCU is accessed by its memory address)

This was one of the major changes from C90/99 to C11, the definition of sequence points was reworked quite radically (in the form, if not in the substance for a single thread) to accommodate some mutlithreading concepts, especially in "5.1.2.3 Program execution" where "sequencing" is introduced in C11.
Accordingly, many other places where C90/99 referred to sequence points changed to references to sequencing.
A typical example is "6.5 Expressions", where §2 introduces one of the most common cases of undefined behaviour:

Quote from: ISO/IEC 9899:1999

Between the previous and next sequence point an object shall have its stored value modified at most once by the evaluation of an expression. Furthermore, the prior value shall be read only to determine the value to be stored.

Quote from: ISO/IEC 9899:2011

If a side effect on a scalar object is unsequenced relative to either a different side effect on the same scalar object or a value computation using the value of the same scalar object, the behavior is undefined. If there are multiple allowable orderings of the subexpressions of an expression, the behavior is undefined if such an unsequenced side effect occurs in any of the orderings.

Volatile description did not really change though, it still uses sequence points!

SiliconWizard · « **Reply #9 on:** December 09, 2020, 04:03:39 pm »

This is NOT a compiler problem (at least still not at the current state of the C standard, I'll elaborate on that a bit below), but a pretty common trap for embedded software developers.

As already explained by others, optimizers statically analyze code and simplify/prune things as much as they can - while retaining correct behavior (unless there is an optimizer bug of course) as defined in the C standard.

A *very* common issue while writing embedded software is with global variables that are modified inside interrupt handlers (ISRs). There are other potential cases of course, but this one is one of the most common I've witnessed. As a general rule, you should declare such variables "volatile" if they are also used outside of the ISR.

The reason why is simpler than it looks. Optimizers have NO way of knowing ISRs are ever going to get executed, and worse yet, when, because they are never explicitely called in the code. So they will ignore the fact a variable can get modified inside of them while another part of the code is executing and also using the same variable. Based on that, if said variable is analyzed NOT to be modified in the analyzable code flow of some part of the code, the optimizer may assume said variable has a fixed value in said part of the code and prune any expression using it (basically evaluating it at compile time).

The same problem can arise outside of pure "embedded" development and ISRs: in multi-threaded code for instance, for the same reasons.

The "volatile" C qualifier basically instructs the compiler (and its optimizer) that a given variable CAN change values at any time even if it's not apparent from statically analyzing code, so it can't assume anything about the value of said variable, which means: every time it encounters said variable in code, if it's being read, compiled code HAS to read it (even though it may appear that it never changes values), and if it's being written, compiled code HAS to write to it, even though it may appear that the new value is never used.

This is also why, for instance, MCU registers have to be declared "volatile".

Finally, I said I would elaborate a bit. Knowing all this, we may think this could be a good idea to define an additional qualifier in the C standard for qualifying functions that compilers shouldn't assume anything about regarding their execution (such typically as ISRs), and if said function was declared this way, then compilers should assume any variable used inside it be "volatile". (And you may additionally think that since it's often - but not always - required to declare ISRs with specific attributes, then said qualifier could be assumed as well instead of having to be explicitely added. But all this would actually just cover particular cases.) It could save a bit of work (although you'd still have to remember to declare such functions appropriately), but would be overall inefficient. C prefers giving the programmer control over things like this.

Regarding C and the "volatile" qualifier, the lesson to learn is that using the value of (respectively assigning) a variable is NOT guaranteed to be executed verbatim in a given context as long as the result is the same, and since the compiler has no means of knowing what happens outside of your code, if anything outside of your code can modify a variable, then it should be declared "volatile".

That's the basic rule, but "volatile" has other uses - always with the same principles in mind. For instance, a loop incrementing a local variable but doing nothing else has a big probability of being optimized out. Such as this:

Code: [Select]

for (int i = 0; i < 100; i++) {}If you want to make sure the loop is actually generated, you can do this instead:

Code: [Select]

for (volatile int i = 0; i < 100; i++) {}Note that the second version will NOT be optimized out, but it's also less "efficient" (read: will usually take more cycles to execute) than you may expect, as in this case, the local variable "i" won't be compiled as a register, and will read, increment and write this variable on the stack at each iteration. This is at least what GCC does at -O3 as far as I've seen.

Jan Audio · « **Reply #10 on:** December 09, 2020, 04:32:28 pm »

I agree with CV007 you need to set it optimized when creating a new project ( so you wont forget ).
Use the volatile keyword sparingly for better optimizing, try to avoid it, use only 1 variable or 2 in your interrupt nothing more.
Copy or process the volatile things in the beginning of your main loop, and proceed with the main reuasable code.

coppice · « **Reply #11 on:** December 09, 2020, 04:33:34 pm »

Quote from: Siwastaja on December 09, 2020, 09:57:49 am

It's a good idea to keep the optimization on, not only to get a program with better performance, but also to force you to write even remotely correct code. The problem with the "debug the shit out of the system by single-stepping everything" mindset is that for that to work, you want to compile at low optimization level (to make the generated assembly match the C source better, enabling decent simple-stepping of C source). This detaches the system-under-debug from the actual system.

The biggest problem with using a low optimisation option for an MCU is that most times the program won't actually fit in the device. Not running at an adequate speed, and other factors, are important, but size is a killer. You end up having to strip out a lot of the code, until you get to a subset that will fit and still exercise the section you are trying to debug. This can be useful when you really get stuck, but its a desperation move. Symbolic debugging for MCUs is a much oversold feature.

ajb · « **Reply #12 on:** December 09, 2020, 05:08:26 pm »

Volatile qualification is important, but there's another fundamental concept that needs to be understood when writing multi-threaded code, including IRQs in MCUs: atomicity.

Consider a simple variable increment operation like:

Code: [Select]

foo++;
This looks like one operation, but except in specific cases, it will be at least three:

Code: [Select]

1: load 'foo' from memory into a CPU register
2: add 1 to the register value
3: store 'foo' from register back into memory

This is a read-modify-write pattern, and happens allllll the time in code. Consider what happens if an IRQ comes along after step 1 and before step 3 and changes the value of foo in memory. The ISR will complete and execution will return to line 2 or 3, and then the final store in line 3 will overwrite whatever value the ISR wrote. Similar problems can crop up in other contexts, for example:
- copying a bunch of data from one buffer to another: if an ISR is executed halfway through the copy operation it could access a buffer that contains half stale data and half fresh data.
- updating a chunk of state in multiple parts, like variable holding a pointer to a buffer and another variable holding the size of the buffer: if an IRQ tries to access that buffer when only one of the two is updated it may overrun or underrun the buffer.
- storing or retrieving a value that is too large for the platform to access in one instruction (ie variables 16-bit and larger on an 8-bit platform): same issue as the buffer copy, where an interruption will leave the contents of that location in an invalid half-updated state

These are all examples of operations that must be atomic, which is to say that once started they must be completed without interruption. These sections are called "atomic sections" or "critical sections". A brute-force approach to achieving this atomicity is to simply disable interrupts before the atomic section begins and re-enabling them after. Just be aware that this approach does not nest, so if a function that disables and re-enables interrupts around a critical section is nested inside another critical section, the inner function will re-enable interrupts before the outer critical section is completed. Some environments/libraries have defined tools for this (macros or functions) that keep track of how deeply they're nested and only re-enable interrupts when the outermost critical section is complete. In the context of an OS, you can also use a lock or mutex, although that may not be the best solution for protecting against IRQ interruptions. As a rule, atomic/critical sections should be kept as short as possible to avoid delaying the servicing of interrupts.

cv007 · « **Reply #13 on:** December 09, 2020, 10:01:20 pm »

Even when you think you have things figured out, there are still traps lurking around.

This is from another thread I originated about nRF52 and dma-

A simple function to check if a twi device is ready (a 2 byte register read interested in a status bit- ignore the lack of checking for the read true/false return value, and this is also a function that was replaced with something better). The twi read is blocking in this case but still uses a universal twi function that uses the easy-dma. Seems simple and straight forward, but the mistake here was initializing the buffer where the compiler 'knew' what was in that buffer and the return value was always false and of course the device was never ready (it really did not 'know' what the buffer contained, similar to an isr the buffer was being changed via dma without the compilers knowledge).

bool isReady(){
uint8_t buffer[2] = { 0,0 }; //init buffer for some reason
twi.read(buffer); //twi uses dma, blocking call
return (buffer[0] & 0x20); //return a status bit
}

If I changed optimization or did almost anything to disrupt this function in any way it would then work, but as soon as I was done 'debugging' (where it looks/works ok) and back to -Os the problem would show up again. So in this case dropping optimization let me see the function more clearly, but also did not let me see the problem. I did figure out by looking at the asm listing what was happening, but a better approach sometimes can be to make the function in question noinline so you get to see the function standalone at the optimization level you are in, although that can also introduce changes which may affect what is happening (I don't remember now if I did that, but in this case the isReady() function would most likely not show up when making it noinline, which would have been a good clue but I already knew optimization was involved, I just did not notice the dma thing). Initializing the buffer was also not necessary, but I thought it was harmless. There is also a point with this compiler (gcc arm) where it gives up 'knowing' what is inside an array even when you just initialized it (in this particular case when it went to 4 bytes it would access the buffer as if it didn't know what was in it).

A little bit of an odd/uncommon problem, but it does point out an isr is not the only thing that can fool a compiler.

Now, had I been using something other than -Os I probably would have been clueless to the potential problem and after using for a while I probably would have enabled -Os with the same kind of results as the original post- all was working just fine, now its broke, what happened.

brucehoult · « **Reply #14 on:** December 09, 2020, 11:44:33 pm »

Quote from: newbrain on December 09, 2020, 08:45:38 am

Quote from: luiHS on December 09, 2020, 02:15:10 am
This optimization thing is a real nightmare at times.
Translation:
"This optimization thing is a good test to see if my code is sound"

Of course, compilers have bugs too, but they're quite rare.

Extremely rare. If you're using gcc or llvm on a popular platform then you should basically assume there are no bugs. And especially not in the machine-independent parts, such as most optimisation.

Nominal Animal · « **Reply #15 on:** December 10, 2020, 12:01:29 am »

The atomic built-ins GCC provides are useful on certain architectures – extremely useful on x86 and x86-64, for example – but on ARM Cortex-M0/M1/M4, they are implemented as calls (to I guess a libatomic library?), so I strongly recommend against using them on ARM. I do believe this is also the reason CMSIS implements the necessary stuff as macros, instead of relying on GCC intrinsics.

Of the C extensions GCC provides, I commonly use on ARM

Binary literal constants like 0b101
__attribute__((aligned (bytes)) and __attribute__((packed)) variable and type attributes
__attribute__((fallthrough)) (via a macro, such as FALLTHROUGH) statement attribute indicating when a case deliberately falls through to the next case, so I can use -Wimplicit-fallthrough compiler option when writing complex switch statements and detect when I forget a break;
&&label to obtain the address of a label (for finite state machines)
typeof, especially in trivial helper macros
__attribute__((section ("name"))) function and variable attributes; latter especially to collect array element definitions in multiple compilation units into a single consecutive array via the GNU linker-defined __start_name and __stop_name symbols; and otherwise for controlling linking in general
__attribute__((weak)) function attribute, used in library code where user is allowed to replace the library implementation with their own, simply by implementing a function with the proper signature
__attribute__((const)) function attribute when the function does not change the state and the return value depends only on the arguments; things like math functions
__attribute__((pure)) function attribute when the function return value depends only on the arguments (but is allowed to change program state)
Extended inline assembly especially with functions defined as static inline in header files

Quote from: cv007 on December 09, 2020, 10:01:20 pm

Code: [Select]
bool isReady(){ uint8_t buffer[2] = { 0,0 }; //init buffer for some reason twi.read(buffer); //twi uses dma, blocking call return (buffer[0] & 0x20); //return a status bit }

Ouch; right. To simplify, the problem here is that the twi.read() call causes the buffer to be modified by hardware (or some other code the C++ compiler cannot see). It is an implementation bug, though, not yours.

A memory barrier at the end of twi.read() would fix this. Basically, it needs to tell the compiler that the parameter buffer may now have changed.

A general compiler memory barrier would be a bit too big hammer for my tastes. If we have n bytes so modified, starting at memory pointed to by buf, then
asm volatile ("" : "+m" (*(char (*)[n])buf));
should suffice (and generate no additional code). It simply tells the compiler that it should consider the n bytes of data at buf to have been modified.

In source code, this should probably be hidden behind a macro, say POSSIBLY_MODIFIED(ptr, size), defined for GCC as

Code: [Select]

#define POSSIBLY_MODIFIED(ptr, len) __asm__ __volatile__ ("": "+m" (*(char (*)[len])(ptr)))(barring typos). Then, programmers only need to remember that before they access an array that may just have been modified by something, just use that macro on it before examining it (or returning it to the caller).

brucehoult · « **Reply #16 on:** December 10, 2020, 12:03:00 am »

Quote from: GromBeestje on December 09, 2020, 09:17:03 am

The thing with that is, while developing I would like to be able to step through the code line-by-line. That won't work when you optimise.

omg. Why?

My PC runs over 10 billion instructions a second. Trying to figure out what's going on by single-stepping at 1 instruction per second is -- well, none of us are going to live even 3 billion seconds, so we physically can't single-step even one second of what our computers do. Even 1 ms isn't all all practical, and 1 us would be tedious as all hell.

For a microcontroller adjust by a factor of 100 to 1000, but it's still ridiculous.

I can see a student who doesn't yet understand what computers or programming languages or algorithms do single-stepping a simple function. But it's honestly not something I've done or wanted to do in the last 35 years probably.

What *is* useful are assertions -- to tell you that something happened you didn't expect -- and an ability to optionally (based on runtime or compile time flags) log critical values, perhaps conditionally.

That stuff lets you quickly glance at perhaps hundreds of lines of output and quickly spot where something went wrong or different. Or you can use tools such as perl or python to analyze the logs.

This logging/debugging stuff can and should be left in the source code forever, just disabled in production builds.

When analysing someone else's code that I don't understand, I *add* such logging code and assertions, rather than single-stepping.

You can also use a debugger to add breakpoint actions or conditional breakpoints. This is far better than single-stepping, but suffers from two problems: 1) such things are interpreted and add a lot of overhead to execution, and 2) they are transient and the next person to debug the code (which might be you) has to recreated them again. I'd only do this if the compile and download/flashing time is very long.

brucehoult · « **Reply #17 on:** December 10, 2020, 12:11:58 am »

Quote from: cv007 on December 09, 2020, 10:01:20 pm

bool isReady(){
uint8_t buffer[2] = { 0,0 }; //init buffer for some reason
twi.read(buffer); //twi uses dma, blocking call
return (buffer[0] & 0x20); //return a status bit
}

If I understand you, you're saying the problem is with optimisation the twi.read() function gets inlined.

The actual problem will be not using volatile.

Either the twi.read() function should take its argument as "volatile uint8_t*" (forcing you to also declare buffer that way, which can make processing it later slower than necessary), or else tsi.read() should cast to volatile inside.

Nominal Animal · « **Reply #18 on:** December 10, 2020, 12:22:20 am »

Understanding what the code does is not that hard. Understanding why the code does what it does, is sometimes hard.

Tracking program state pinpoints where ones understanding of what the code should do diverges from what the code actually does.
Single-stepping is just one way you can make sure your mental picture of the program state matches the actual program state.

So, we don't actually debug code that often – I'd say only compiler writers and inline assembly writers really do that, and those who suspect a compiler bug.
Usually we debug unexpected program state changes (and the call chains due to the program state).

Quote from: brucehoult on December 10, 2020, 12:11:58 am

or else tsi.read() should cast to volatile inside.

What matters is that the compiler understands that the buffer contents may have changed. I'm not sure aliasing the buffer using a pointer to volatile data suffices. I definitely cannot decipher the C++ standard well enough to have an opinion whether it should suffice or not. (Technically, it is a matter of whether having a reference to volatile data is enough to indicate possible change of those contents in the original block.)

Is there a reason you don't like the inline assembly mark-this-region-as-possibly-modified approach?

SiliconWizard · « **Reply #19 on:** December 10, 2020, 12:57:03 am »

I'm not sure about C++, but in C, I think defining whole buffers as volatile works as follows: for statically (or on the stack) allocated buffers, just declare an array with the "volatile" qualifier. The whole array will be considered volatile. But in case the buffer is dynamically allocated, I think refering to such a buffer with a pointer declared as such should, or may, work (again I'm not 100% sure, but it seems reasonable to me):

Code: [Select]

volatile Basetype *pointer;

The idea is, as when using the "const" qualifier, all dereference accesses with such a pointer will be considered volatile. But in this case, all pointers referencing said buffer should be declared like this, because there is no way you can instruct the compiler that the allocated memory block itself should be considered volatile. (Additionally, such a pointer will also be compatible with a volatile *array*.)

Which reminds me (I think) an earlier thread about a similar issue, with a piece of code using memset() or memcpy() from the std library. In this case, this was unsolvable as those functions do not have their pointer parameters declared as volatile in their prototypes, so the compiler could still optimize things out. The solution was to define your own memset() (et al.) function with volatile-qualified parameters.

brucehoult · « **Reply #20 on:** December 10, 2020, 01:45:34 am »

Quote from: Nominal Animal on December 10, 2020, 12:22:20 am

Is there a reason you don't like the inline assembly mark-this-region-as-possibly-modified approach?

No, that looks fine. I came to that after writing my message. I consider that a (possibly superior) way to get the compiler to notice the variable is volatile -- but at a point in time, not always.

Ideally it should be used inside twi.read(), but if the person who programmed that was silly, and you can't fix it for some reason, then after it will be ok too.

The problem only occurs if it is inlined -- or, possibly in some cases visible to the compiler at the same time and it analyses it in the context of the call site but decides not to inline it.

I think some compilers are now too eager to inline code. If it's the only use, then ok fine, but if its used multiple times then it should only be inlined if it actually reduces code size -- which can include not just the call site itself but also moving arguments and results to/from other places and also possibly turning the caller into a leaf function that now doesn't need to build a stack frame or save/restore registers.

Some compilers will inline a function because the function can be simplified because certain arguments are constants. Often with modern branch predictors that's not worth it even for speed because the branches will be predicted correctly based on the different callers. If specialization does help the speed it's often better to clone a specialized version of the function, instead of inlining it, especially if it will have several callers, but few compilers do that.

It *is* becoming more common for compilers with -Os or -Oz to find repeated code sequences in a program (regardless of the origin) and "outline" them -- that is, extract them into a function. That needs to be done at a relatively high level -- before register and instruction selection -- to be effective.

cv007 · « **Reply #21 on:** December 10, 2020, 02:05:44 am »

>If I understand you, you're saying the problem is with optimisation the twi.read() function gets inlined.

The problem was below some optimization the compiler just did as instructed and used the buffer contents to determine the return value after the twi read (the read will be done in any case since volatile twi registers are involved). Above some optimization level (don't remember which level it started, but I use -Os so that was where I started), the compiler saw that nothing would be changed in the buffer as it sees it, so although the twi.read went through its process the compiler saw nothing that would change buffer.

So 'inline' is not involved, and contrary to what I said previously if I had forced the function to be noinline (so its a called function, where I can get a better look at it as a single thing at the same optimization level) it would still exist because of the use of the volatile twi register(s) used in the read function, but I would/should then have seen a return value of always being false (the buffer not used in the last line as compiler already knows 0 & anything is 0). I went through the asm and discovered the buffer not being used to produce the return value, and still scratched my head for a while until 'dma' entered my mind.

>The actual problem will be not using volatile.
>Either the twi.read() function should take its argument as "volatile uint8_t*" (forcing you to also declare buffer that way, which can make processing it later >slower than necessary), or else tsi.read() should cast to volatile inside.

The simplest solution was to not initialize the buffer so the compiler had no way of knowing what the buffer could be. I think I already had a template in place for the twi function so I could pass various types that convert to an address for the dma to use, so could also have made the buffer volatile (which was my first solution). Using any buffer marked as volatile for dma use is probably quite a rare thing to want, as you really don't want to make the compiler access every buffer byte unnecessarily later when it is processing the buffer in some way (casting also not desired by me, unless no other options). It also turns out the gcc arm compiler appears to 'give up' trying to determine what an array may contain if its size gets bigger than 3-4 bytes (in what I tested, anyway) so the problem was quite a narrow one.

Not hard to figure out how to fix once the problem is known, but the point I was trying to make was there can be some subtle things besides isr's that can trip you up. In this case, I called a function to fill in a buffer and didn't expect the compiler to conclude nothing would be changing in the buffer. I just didn't see through the read function which actually was not writing to the buffer on its own (so unknown to the compiler). Rare conditions, but now I know dma is also a possible source of hiding things from the compiler and will add it to my collection.

edit- I may have been wrong about that 3-4 byte thing-
https://godbolt.org/z/Ynoqzn
not sure what gcc version I was using earlier, but this seems to indicate the compiler can figure things out more than I had given credit.

Nominal Animal · « **Reply #22 on:** December 10, 2020, 02:22:56 am »

Quote from: SiliconWizard on December 10, 2020, 12:57:03 am

Code: [Select]
volatile Basetype *pointer;

Yes, but in this particular case, the problem is more like

Code: [Select]

{
    char buffer[2] = { 0, 0 };
    {
        volatile char *const ptr = buffer;
        // Use ptr to set up  machine registers for DMA, but do not explicitly dereference it.
        // Wait for DMA to complete
    }
    // Is the compiler allowed to assume buffer[0] == 0 and buffer[1] == 0 ?
}

The compiler sees a pointer to volatile contents of the buffer, but not it being dereferenced; does that mean that in the outer block, it has to assume the buffer may have changed? The aliasing rules and C++ standard in general is not something I want to pore through to find out (and I suspect the answer is it's implementation defined or somesuch).

My suggested fix goes like this:

Code: [Select]

{
    char buffer[2] = { 0, 0 };
    {
        // Use buffer to set up  machine registers for DMA, but do not explicitly dereference it.
        // Wait for DMA to complete
        asm volatile ("": "+m" (*(char (*)[2])buffer));
    }
    // The compiler knows the memory (storage representation of the buffer array) may have changed,
    // so it is not allowed to assume buffer[0] == 0 and buffer[1] == 0.
}

The inline assembly statement is the only way I know of how to ensure the compiler knows the data may have changed, without generating any extra code. (A call to an externally defined function that takes either buffer as a parameter would also work, but that would be extra code, even if the call simply returned immediately.)

Of course, this only works with GCC (and other C/C++ compilers that support the same inline assembly syntax; I think clang at least).

The simplest standards-compliant way is to mark the buffer volatile,

Code: [Select]

{
    volatile char buffer[2] = { 0, 0 };
    {
        // Use buffer to set up  machine registers for DMA, but do not explicitly dereference it.
        // Wait for DMA to complete
    }
    // The compiler is not allowed to assume buffer[0] == 0 and buffer[1] == 0?
}

(or equivalently, the buffer content references using pointers that declare the pointed-to data volatile).

If the method – the inner code block above – was compiled in a different compilation unit, then there would be no problem, because the compiler would not be allowed to make any assumptions about the array contents across the method call.

However, when compiled in the same unit, so the compiler sees the full implementation, I am not certain even making the method parameter a pointer to volatile data suffices, because the volatility of the data may not propagate across the aliasing to the enclosing block. In other words, with the method having a pointer to volatile data as an argument, its body for sure would not be allowed to make any assumptions about the buffer contents, but whether the compilers propagate that to the caller, is ... questionable. In the past, this has been exactly the point where idiotic interpretations of the standard have occurred, leading to compiler behaviour that is opposite to what compiler users actually want, need, and assume.

You could say that it is obvious that the only thing that makes sense in practice is for the volatility to effectively do what the inline assembly does, tell the compiler that it no longer can make any assumptions about the contents of that buffer, but .. you know. Standards.

I do expect the standards committees to talk about this at some point, because this is so closely related to atomicity, barriers, aliasing, etc. they've waffled on for years. Something about invalidating compiler assumptions about exact contents of variable storage representation or memory ranges.

cv007 · « **Reply #23 on:** December 10, 2020, 03:16:49 am »

>The simplest standards-compliant way is to mark the buffer volatile,

This is equivalent to what I did-
https://godbolt.org/z/e8cMaz
just did not initialize the buffer so the compiler has no knowledge about its contents. Since there is no need to initialize a dma buffer to receive data, it is not necessary and solves the problem.

brucehoult · « **Reply #24 on:** December 10, 2020, 06:19:11 am »

Quote from: cv007 on December 10, 2020, 02:05:44 am

but now I know dma is also a possible source of hiding things from the compiler and will add it to my collection.

DMA, other CPU cores, other threads, ISRs -- they are all the same thing. They can change global data in between arbitrary instructions. Even if you load from the sam memory location in two consecutive instructions, you may get a different value each time.

Disabling interrupts will prevent ISRs and other threads on the same CPU core, but they don't stop other cores or DMA.

Anything that can be modified by any of those -- or that controls any of those -- needs to be volatile, or marked as unknown using Nominal Animal's technique at appropriate times.

DiTBho · « **Reply #25 on:** December 10, 2020, 08:15:41 am »

I have recently worked with the control board of a couple of industrial embroidery machines. There are hardware mailboxes and more than one CPU per board and even peripherals do massively use DMA.

In their projects, I see that Japan engineers isolated the "critical code" into blocks which begin with #pragma-like instances (unfortunately specific to the proprietary toolchain they use) to locally disable all the optimizations inside a block of statements.

#disable optimization
----- critical code
#enable optimization

With Gcc, I have recently seen a workaround for a quad-copter drone between the mission-board (Linux-based) and the fly-board (RT-OS driven): the two boards talks via DMA on a shared ram with hardware semaphores and someone isolated the critical code into modules, and forced each module in the Makefile to be compiled without any optimization.

So, the project has a global Cflags that are applied to all the C files, but the critical modules override the Cflag locally with explicit "no-optimization here, please"

Sugar solution.

DiTBho · « **Reply #26 on:** December 10, 2020, 08:28:59 am »

Someone also described all the pointers to the shared-ram as "opaque" (I mean just described as "external" but not declared), and moved the declaration into the linker script.

This "seems working" probably because in this way the C compiler doesn't assume anything, it just knows there is *somewhere* a pointer for a certain data-type and with a certain data-size, but it cannot optimize anything, while the linker script describe the data-type as pure starting address, length, session-kind and alignment type, making it consistent regarding the memory layout.

Interesting

TomS_ · « **Reply #27 on:** December 10, 2020, 10:18:48 am »

Quote from: brucehoult on December 10, 2020, 12:03:00 am

omg. Why?

I single step my code quite often, maybe the first couple of times after initially writing it while I verify that it does actually do what I told it to do, and iron out any issues that I might have introduced or logic I mixed up.

Quote

But it's honestly not something I've done or wanted to do in the last 35 years probably.

Thats all well and good for you, but some people find single stepping to be a very useful tool.

Logging is a lot easier to implement on a PC where you can simply open a file and dump stuff into it without potentially taking up real percentages of flash/RAM.

brucehoult · « **Reply #28 on:** December 10, 2020, 12:22:52 pm »

Quote from: TomS_ on December 10, 2020, 10:18:48 am

Logging is a lot easier to implement on a PC where you can simply open a file and dump stuff into it without potentially taking up real percentages of flash/RAM.

If there's a communications channel for a debugger then there's a communications channel for logging.

Siwastaja · « **Reply #29 on:** December 10, 2020, 03:39:30 pm »

Getting to single-step is so easy and quick, that's the whole point. You press some F key in an IDE, your program runs, you hit single-step key, see how your program flows. Great for beginners; later, it boils down to laziness, lack of skill, and finally, there's absolutely nothing wrong in that. Sometimes having just the hammer suffices. You can do surprisingly lot with just a hammer.

You just hit the limits of what you can do with it, but often it's enough. The trap is, you think you have the best thing since sliced bread, but at some point, you are wasting more time using substandard tools because they are easier to run initially.

It's all fine until these single-steppers start claiming their tools are superior to anything else, and using a screwdriver to drive a screw is "unprofessional" and "stupid" because they are so brilliant they can use a professional, nicely wrapped hammer. At that point, I just chuckle to myself and ignore them.

nfmax · « **Reply #30 on:** December 10, 2020, 04:33:14 pm »

It can be quite useful to single-step through convoluted piece of if-encrusted logic with a pencil and paper to hand - but this does not have to be done as part of the target build. Code like that I will sometimes build into a simple test harness I can run under a full-featured IDE on the development computer. Single-stepping on the target, when the hardware is making things happen under you with an uncontrollable speed and timing, is much more problematic.

voltsandjolts · « **Reply #31 on:** December 10, 2020, 04:40:18 pm »

Quote from: brucehoult on December 10, 2020, 12:03:00 am

Quote from: GromBeestje on December 09, 2020, 09:17:03 am
The thing with that is, while developing I would like to be able to step through the code line-by-line. That won't work when you optimise.
omg. Why?
<snip>....honestly not something I've done or wanted to do in the last 35 years probably.

Never single stepped out of a trap to see where it occurred?

Edit: fixed quote.

SiliconWizard · « **Reply #32 on:** December 10, 2020, 04:43:12 pm »

Quote from: Nominal Animal on December 10, 2020, 02:22:56 am

Quote from: SiliconWizard on December 10, 2020, 12:57:03 am
Code: [Select]
volatile Basetype *pointer;
Yes, but in this particular case, the problem is more like
Code: [Select]
{ char buffer[2] = { 0, 0 }; { volatile char *const ptr = buffer; // Use ptr to set up machine registers for DMA, but do not explicitly dereference it. // Wait for DMA to complete } // Is the compiler allowed to assume buffer[0] == 0 and buffer[1] == 0 ? }The compiler sees a pointer to volatile contents of the buffer, but not it being dereferenced; does that mean that in the outer block, it has to assume the buffer may have changed?

Your example indeed won't "work". Declaring the pointer to volatile at the level you did here is useless, as you understood.
If you're using the buffer outside of this block and want the compiler to assume it may have changed, you need to use a pointer to volatile at the level of the outer block to dereference it.
Based on your example, the correct way of doing it would be:

Code: [Select]

{
    volatile char buffer[2] = { 0, 0 };
    {
        volatile char *const ptr = buffer;
        // Use ptr to set up  machine registers for DMA, but do not explicitly dereference it.
        // Wait for DMA to complete
    }
    // Is the compiler allowed to assume buffer[0] == 0 and buffer[1] == 0 ?
}

Alternatively, you can declare the array non-volatile and access the content via a pointer to volatile (but of course at the outer level), such as this:

Code: [Select]

{
    char buffer[2] = { 0, 0 };
    {
        char *const ptr = buffer;
        // Use ptr to set up  machine registers for DMA, but do not explicitly dereference it.
        // Wait for DMA to complete
    }
    // Is the compiler allowed to assume buffer[0] == 0 and buffer[1] == 0 ?
    volatile char *ptr = buffer;
    ... ptr[0] == 0 ...
}

Nominal Animal · « **Reply #33 on:** December 10, 2020, 06:06:56 pm »

Quote from: SiliconWizard on December 10, 2020, 04:43:12 pm

If you're using the buffer outside of this block and want the compiler to assume it may have changed, you need to use a pointer to volatile at the level of the outer block to dereference it.

Yes, exactly. (I only quoted the above small part, but I do believe we are in full agreement.)

GCC already has for example a built-in to describe known pointer alignment, __builtin_assume_aligned(pointer, alignment [, offset ] ). This generates no code, just changes the compiler assumptions about how the pointer (target/value) is aligned.

I expect a similar built-in to be provided, say __builtin_assume_modified(pointer, size), through ARM GCC efforts initially, because this is becoming more and more of a problem in embedded targets. It simply tells the compiler to invalidate all its existing assumptions about the contents of the referred to memory region, and does not generate any code (like hardware read or write barriers or such). It fixes the issue without extra side effects.

An "implementation" of such built-in is trivial, because we can do it with GCC already (since version 3, probably earlier), via

Code: [Select]

#define __builtin_assume_modified(ptr, len)  __asm__ __volatile__ ("": "+m" (*(char (*)[len])(ptr)))

and, the syntax itself has been explicitly shown in the GCC Extended Asm documentation. (The reason for making it a built-in is twofold: being a built-in encourages compatibility across compilers; and being documented at the source, would make it easier to point it out and get embedded libraries and frameworks to use it where needed. As it is, the macro itself is a side effect of extended inline assembly, and happens to only rely on the "m" output modifier, which is common to all architectures. Having something more explicit for the task would be clearer for all.)

This is explicitly useful for DMA, and for any other mechanism where the storage representation really isn't volatile in any sense, only modified once by a mechanism invisible to the compiler. In typical hosted C and C++ environments this is usually not an issue, because being compiled in a separate unit provides a similar barrier for the compiler assumptions of the contents; but in embedded and microcontroller environments, where everything is often compiled in the same unit, we do actually need something like this.

It might be worth it to talk to the arm gcc folks about this, actually. I'm just a nobody, and don't exactly relish telling others what they should do to support their users better, but if somebody already has contacts with them, consider pushing this upstream a bit.

DiTBho · « **Reply #34 on:** December 10, 2020, 06:28:13 pm »

What about the above two points I posted?
Too bad? or may be interesting alternatives?

Nominal Animal · « **Reply #35 on:** December 10, 2020, 06:49:06 pm »

My opinion only:

Using the linker symbol is roughly equivalent to compiling in a separate unit. It works, but it is complicated, and hides the actual issue from human programmers. Future programmers will have to re-discover the problem and the solution for themselves.

Using pragmas to modify optimizations to get code to work like you want is a hack. It hides the problem by making the compiler stupid.

I like the assume-modified marking approach, because it fits the C and C++ standard models, is easy to understand ("okay compiler, this region of memory may have changed, so don't make any assumptions about its contents, okay?"), and is simple to implement.

Others may disagree.

brucehoult · « **Reply #36 on:** December 10, 2020, 09:40:27 pm »

Quote from: voltsandjolts on December 10, 2020, 04:40:18 pm

Quote from: brucehoult on December 10, 2020, 12:03:00 am
Quote from: GromBeestje on December 09, 2020, 09:17:03 am
The thing with that is, while developing I would like to be able to step through the code line-by-line. That won't work when you optimise.
omg. Why?
<snip>....honestly not something I've done or wanted to do in the last 35 years probably.

Never single stepped out of a trap to see where it occurred?

It's pretty rare when I use an interactive debugger. On embedded systems there's often a lot of background processing, DMA, network things to time out, which mean it's simply not practical to sit in a debugger staring at the screen because everything else will fall to pieces while you do that.

If I *was* in that situation I might set a breakpoint on the RFI instruction and then single-step *once*, if that was supported in that environment.

But more likely I'd look at what the interrupt stored on the stack or in the ExceptionPC CSR or wherever it is stored on that architecture and just read the exception PC from there.

Or log it at the start of the interrupt handler for later analysis.


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

Author Topic: GCC compiler and optimization. (Read 5229 times)

Share me