Author Topic: GCC compiler optimisation (Read 38151 times)

SiliconWizard · « **Reply #100 on:** August 11, 2021, 04:14:45 pm »

Quote from: peter-h on August 11, 2021, 02:37:12 pm

Could any of you experienced chaps suggest on which of these are worth enabling?

-Wall
-Wconversion

This last one, I talk about on a regular basis. It will just warn you about all dubious implicit conversions. Very useful, especially for embedded development.

-Wextra can get you a lot of extra noise with not a lot of added value, compared to the two above. But you can experiment and see for yourself.

In any case, I suggest reading this to better understand what those flags are all about: https://gcc.gnu.org/onlinedocs/gcc/Warning-Options.html

ejeffrey · « **Reply #101 on:** August 11, 2021, 04:22:29 pm »

Quote from: peter-h

I am using -Og now, and I get 160k (down from 225k with -O0 and max debugs) in Debug build versus 179k in Release build with -O3. How is that possible?

Not sure what size you are reporting, but why do you think this is strange? None of those options are deliberately optimizing size, so the variation in size is incidental.

-O0 will have a bunch of extraneous and redundant data movement operations so it will be big. -Og, -O1, and -O2 will enable a bunch of optimizations, and many of those will involve removing redundant instructions or combining instructions leading to smaller code size. The gcc man page for -O2 (you did read that, right?) says " performs nearly all supported optimizations that do not involve a space-speed tradeoff." although if you actually look at the list of optimizations some of them do moderately increase size. -O3 enables (in particular) a number of loop optimizations that can increase code size. If you want the smallest code size you need -Os.

Quote from: newbrain

As a habit, I use -Og for debug compiles: the code remains eminently debuggable, but a lot of pointelss memory/register shuffling and pushing/popping is removed.

As you should, -O0 won't give you all the warnings you want. At least in the past, -O0 disabled reachability analysis so it wouldn't warn you about unreachable code, computed values never used, or use of uninitialized variables. I'm not sure what the current behavior is exactly but definitely the recommendation continues to be to use -Og for debug builds to get maximum diagnostics. I think the only reason to use -O0 is to get maximum compilation speed.

peter-h · « **Reply #102 on:** August 11, 2021, 07:26:23 pm »

Very interesting.

-O2 gives me 160k, versus 179k with -O3.

I have switched the Debug build to -Og.

SiliconWizard · « **Reply #103 on:** August 11, 2021, 07:30:12 pm »

With GCC at least, -O3 almost always yields larger code size. The main reason for this that I've seen is that the compiler will aggressively inline every function it can at -O3.
Up to you to determine whether -O3 is worth it performance-wise in your particular application. Sometimes it is, sometimes it's not.

peter-h · « **Reply #104 on:** August 11, 2021, 09:07:35 pm »

I wonder how this maps onto the 32F4 cache, which loads x bytes (?) from the FLASH at a time into zero-waitstate RAM, plus there is the instruction prefetch. That scheme is likely to run a loop (which fits into x bytes) a lot faster than unrolling the loop. Inlining functions probably is better because it would work with the prefetch.

Inlining functions is ok for small functions, obviously, but big ones will massively swell the code.

EDIT: I have just found that -Og has optimised out a huge amount of my startup code. The rest is basically running but I have e.g. a 512 byte buffer into which I read some eeprom data and then look at various values in that, and I think it stripped out all that code. It doesn't make sense to me why. Maybe putting 'volatile' in front of every variable might be worth a try. So I am back to -O0 and it works just fine.

ataradov · « **Reply #105 on:** August 11, 2021, 11:35:54 pm »

At this point putting an effort into understanding volatiles might be a better option.

If compiler strips out half your program, you are clearly doing something wrong. Why struggle? Even if you just want to brute force it, set seemingly related variables as volatile and see which ones really need to be volatile.

I would be very scared to work with a fragile code like this.

Also, compilers sometimes will strip significant portions of the code if it includes undefined behaviour, since one version of the UB code is no better than any other, so you might as well strip it.

ejeffrey · « **Reply #106 on:** August 12, 2021, 01:11:04 am »

Another possibility is that the problem is not with volatile but with pointer casting abuse running afoul of the aliasing rules. Only the hardware registers should really need to be volatile from what I can piece together.

For instance you have a block of memory that is zero initialized, then load it from EEPROM via a function that takes an integer rather than char * as a parameter, then do some computation the compiler may assume the data is still zero and optimize away everything that touches it. Peppering the code with volatile might make it work properly but isn't the actual bug.

peter-h · « **Reply #107 on:** August 12, 2021, 05:42:51 am »

I will investigate tonight but basically I am reading a 512 byte block from an eeprom, into a buffer on the stack. Then picking some boolean flags out of that, and based on those I do various other things.

That buffer is being incorrectly filled and is full of 0x55 which is the memory init value. The ports are all volatile-defined. ST use the __IO prefix which is #defined as volatile.

It looks like the compiler is seeing the content of the buffer as static values and deciding the booleans must be false and then removing a load of later tests on that basis.

Never seen anything like it... but it won't be hard to dig around, because when stepping you see half the variables showing as "optimised out".

Obviously I need to find out why that buffer read fails, but it is complicated code, from the ST SPI lib and the ST (or Adesto) 45DBxx lib.

Also stuff like

Code: [Select]

        uint8_t ssa_buf[512];
	SSA_read(0, ssa_buf);
   	uint32_t size1 = ssa_buf[4]|(ssa_buf[5]<<8)|(ssa_buf[6]<<16)|(ssa_buf[7]<<24);
        uint32_t size2 = ssa_buf[8]|(ssa_buf[9]<<8)|(ssa_buf[10]<<16)|(ssa_buf[11]<<24);

where the two uint32s are being removed. Previously they were removed if not referenced, which is normal. They are valid statements regardless of the content of ssa_buf.

My code is damn simple. No pointers

I do pass some parms by address but very rarely and only if necessary to do the job (like the crc accumulator in a one byte at a time crc func).

I had a quick look and I see this utterly weird thing:

code starts off ok, but by the time I get to the bottom (the green-highlighted line) it is showing as 'optimised out'. How is that possible?

newbrain · « **Reply #108 on:** August 12, 2021, 06:19:05 am »

Please note that when you see 'optimized out' while debugging, it only means that the generated code is not reserving a real variable in memory but it dos not mean it's doing away with the statements you have written.

Classic examples are temporary variables one might use to hold intermediate results in a calculation, quite often also loop indexes etc.

That said, I would really generalize ataradov's advice about learning the meaning (semantics) of C: I'm under the impression you've been trying to run before you could walk (and plagued by having to cope with Eclipse - barf)
On how to do this, unfortunately I cannot really say: having learned bad C from Schildt's books in a distant past, my salvation came only when I took the the (C89, at the time) standard and its rationale document and studied it cover to cover - but I understand that's not for everyone...

Sorry if I come out as patronizing, but the sheer number of posts does seem to indicate some basic (well, actually C

) issue. Not that I don't enjoy the discussion, there is always room to learn and refine one's knowledge!

ataradov · « **Reply #109 on:** August 12, 2021, 06:36:01 am »

Are you sure the code is actually removed? Eclipse may just be confused by the debug information. Those variables may have just ended up in registers, so no memory is needed for them. Technically this is indicated in the debug information, but ability of tools to interpret this information varies.

The easiest way to debug stuff like this is to "highlight" the section with nops and look in the disassembly:

Code: [Select]

asm("nop");
asm("nop");
asm("nop");
uint32_t size1 = ssa_buf[4]|(ssa_buf[5]<<8)|(ssa_buf[6]<<16)|(ssa_buf[7]<<24);
uint32_t size2 = ssa_buf[8]|(ssa_buf[9]<<8)|(ssa_buf[10]<<16)|(ssa_buf[11]<<24);
asm("nop");
asm("nop");
asm("nop");

The code for the expressions would be placed between the nops. You can look at the code and see what it is doing exactly and you will see where those values are stored.

peter-h · « **Reply #110 on:** August 12, 2021, 07:10:45 am »

OK, yes, disassembly shows the code is there; 'code' is held in a register. I am very familiar with asm

Amusingly, I think this is where one issue is: a delay func for use where there is no timer tick

Code: [Select]

// Hang around for delay in ms. Approximate but doesn't need interrupts etc working.

static void hang_around(uint32_t delay)
{

	uint32_t fred = 17000*delay;

	while (fred>0)
	{
		fred--;
	}

}

Putting volatile there makes a lot of stuff work

newbrain - the reason I post a lot is because I have nobody else I can ask. I am working more or less alone. I have nobody who is available as a "C consultant". I have actually tried to set something like that up (via my little business) but nobody is interested. I paid one guy £500 to configure a server with a PHP prog running on an existing server, and he charges 50/hr for extra work, which is fine, but wanted 500/month for ongoing "support" which is way too much for what will be needed (maybe 1hr/month). Then I paid another guy ~10k to write a PHP site, from scratch, to a spec, which worked out well and he charges similarly for ongoing, but doesn't do embedded. These are all in the former Soviet Bloc, of course

As it happens, I am looking for someone to do well defined portions of this current project too but it is too early to do that since the API is not all done and documented. I do a lot of googling of course but sometimes a focused Q on a good forum is a good way, and the vast majority of stuff online is garbage... all the way to code examples which could not have ever worked. I have always used forums, for various topics, not just electronics, and run one myself. But I am learning. It is a steep curve in places. I am documenting everything too, with lots of comments in the code (most progs don't comment C at all)

I appreciate everyone's help, and hope that some others quietly reading this stuff might find it useful.

ataradov · « **Reply #111 on:** August 12, 2021, 07:16:15 am »

A much better version of a blocking delay:

Code: [Select]

__attribute__((noinline, section(".ramfunc")))
void delay_cycles(uint32_t cycles)
{
  cycles /= 4;

  asm volatile (
    "1: sub %[cycles], %[cycles], #1 \n"
    "   nop \n"
    "   bne 1b \n"
    : [cycles] "+l"(cycles)
  );
}

The instruction will be either sub or subs depending on the core type. The code assumes no wait states, so placed in SRAM. But you can obviously leave it in the flash, just adjust the division factor.

peter-h · « **Reply #112 on:** August 12, 2021, 07:23:57 am »

Is that better because the asm will not get modified by optimisation?

I guess fred will get put into a register...

ataradov · « **Reply #113 on:** August 12, 2021, 07:31:42 am »

Quote from: peter-h on August 12, 2021, 07:23:57 am

Is that better because the asm will not get modified by optimisation?

It is better because it is predictable and does not depend on the compiler.

Quote from: peter-h on August 12, 2021, 07:23:57 am

I guess fred will get put into a register...

but the surrounding code can be anything and you will need to figure out the constant for different cases.

newbrain · « **Reply #114 on:** August 12, 2021, 07:39:47 am »

Quote from: peter-h on August 12, 2021, 07:10:45 am

OK, yes, disassembly shows the code is there; 'code' is held in a register. I am very familiar with asm

Amusingly, I think this is where one issue is: a delay func for use where there is no timer tick

Code: [Select]
// Hang around for delay in ms. Approximate but doesn't need interrupts etc working. static void hang_around(uint32_t delay) { uint32_t fred = 17000*delay; while (fred>0) { fred--; } }
Putting volatile there makes a lot of stuff work

newbrain - the reason I post a lot is because I have nobody else I can ask. I am working more or less alone. I have nobody who is available as a "C consultant". I have actually tried to set something like that up (via my little business) but nobody is interested. I do a lot of googling of course but sometimes a focused Q on a good forum is a good way. I have always used forums, for various topics, not just electronics. But I am learning. It is a steep curve in places. I am documenting everything too, with lots of comments in the code (most progs don't comment C at all).

Yes, that's a typical case where an optimizer will completely remove the loop, and, as said, I'm enjoying this - so post away at yout leisure and need!
(not that I have any authority to say otherwise

)

Maybe there's one basic concept that needs to be mentioned:
C defines an "abstract machine" that does what you tell it to.
BUT, the only way one has to check that "abstract machine" is doing the right thing it's by its side effects outside of any particular function.
In this case fred is not visible or reachable outside of the function and its address is never taken, the loop does not modify any global object etc. etc.: the compiler is at freedom to rewrite the code as fred=0 (and then throw even that away!).
Note that the time it takes to do something or the fact that a debugger can probe the code are (in abstract) not of any concern to this "abstract machine".
Using the volatile type qualifier instruct the compiler to create code that (again, externally) completes all side effects before the volatile access, and completes the volatile access and its side effects before (again externally) performing other side effects.

I've mostly refrained to quote the standard, but this is how it's described in C11, I think it's quite clear (emphasis mine):

Quote

5.1.2.3 Program execution
1 The semantic descriptions in this International Standard describe the behavior of an abstract machine in which issues of optimization are irrelevant.
2 Accessing a volatile object, modifying an object, modifying a file, or calling a function that does any of those operations are all side effects, which are changes in the state of the execution environment. Evaluation of an expression in general includes both value computations and initiation of side effects. Value computation for an lvalue expression includes determining the identity of the designated object.
3 [ I'll spare you the definition of sequencing ]
4 In the abstract machine, all expressions are evaluated as specified by the semantics. An actual implementation need not evaluate part of an expression if it can deduce that its value is not used and that no needed side effects are produced (including any caused by calling a function or accessing a volatile object).
5 [ signal handling ]
6 The least requirements on a conforming implementation are:
— Accesses to volatile objects are evaluated strictly according to the rules of the abstract machine.
— At program termination, all data written into files shall be identical to the result that execution of the program according to the abstract semantics would have produced.
— The input and output dynamics of interactive devices shall take place as specified in 7.21.3. The intent of these requirements is that unbuffered or line-buffered output appear as soon as possible, to ensure that prompting messages actually appear prior to a program waiting for input.
This is the observable behavior of the program.

EtA: note that for embedded programs, the first clause in 6 applies, the other might or might not (usually not much).

NorthGuy · « **Reply #115 on:** August 12, 2021, 01:28:55 pm »

Quote from: peter-h on August 12, 2021, 07:10:45 am

... the vast majority of stuff online is garbage... all the way to code examples which could not have ever worked.

People copy things over. The call it "content" and they use various techniques to move them to the top of Google searches. Thus, complete garbage often finds way to the tops of Google searches. Also Google doesn't search what you ask anymore. Rather, they use AI to figure out what you really want to search for and show you that. Sometimes it's working, but if it doesn't you're doomed.

IMHO, it's better to use datasheets and specs. C standard may not be easy to read, but it'll give you general idea how things look from the perspective of the standard.

bson · « **Reply #116 on:** August 12, 2021, 07:50:06 pm »

Quote from: ataradov on August 10, 2021, 10:04:00 pm

Quote from: westfw on August 10, 2021, 09:35:04 pm
Is there a command-line switch to stop this interpretation of "standard" functions? (globally, or on a per-function basis?)
Sure. "-fno-builtin" for all of them, and then there are flags like "-fno-builtin-memcpy".

Note that the arguments to memcpy() and friends are not declared volatile, so the calls may still be dropped if they're not needed. (Or you get compile errors complaining about the loss of volatile.) They may also have arguments declared "restrict", meaning the compiler will try to detect if they overlap. For a function like this that would also be undesirable.

Better is to declare specific functions that embody the desired semantics since they differ from common usage, for example:

Code: [Select]

volatile void* vmemcpy(volatile void* dst, const volatile void* src, size_t len) {
   volatile char* dst0 = (volatile char*)dst;
   const volatile char* src0 = (const volatile char*)src;
   while (len-- > 0) { *dst0++ = *src0++; } // or asm volatile (...);
   return dst;
}

Instead of replacing memcpy(), which is only confusing since it already has well-understood behavior.

peter-h · « **Reply #117 on:** August 12, 2021, 09:58:54 pm »

I found this works exactly right

Code: [Select]

// Hang around for delay in ms. Approximate but doesn't need interrupts etc working.

__attribute__((noinline))
static void hang_around(uint32_t delay)
{
  delay *= (SystemCoreClock/4000);

  asm volatile (
    "1: subs %[delay], %[delay], #1 \n"
    "   nop \n"
    "   bne 1b \n"
    : [delay] "+l"(delay)
  );
}

for FLASH, and for RAM based code the /4000 has to be /6000. Yes; surprised me too, since I thought RAM based code runs at a genuine 0WS while FLASH code runs at 0WS if you believe the ST story about the ART

SystemCoreClock is set elsewhere with SystemCoreClock=168000000; because obviously the CPU can't have any way to determine its clock speed

Well, it could use the camera interface to scan the text on the crystal and do OCR on it...

As regards other dodgy code, I wonder about this

Code: [Select]

		uint32_t offset=0;
		uint32_t addr=0x08000000;

		for ( uint32_t page=4100; page<=5119; page++ )
		{
			AT45dbxx_WritePage((uint8_t*)addr,512,page*512);
			offset+=512;
			addr+=512;
		}

where addr is reading the CPU FLASH. Addr is being modified in the loop so could this really be optimised out?

I then have this bit which tests the uppermost (128k) block of FLASH for erasure and programming. That is referencing FLASH memory addresses, but they are being declared volatile, but I wonder if correctly

Code: [Select]



/**
  * @brief  Program word (32-bit) at a specified address.
  * @note   This function must be used when the device voltage range is from
  *         2.7V to 3.6V.
  *
  * @note   If an erase and a program operations are requested simultaneously,
  *         the erase operation is performed before the program one.
  *
  * @param  Address specifies the address to be programmed.
  * @param  Data specifies the data to be programmed.
  * @retval None
  * Waits for previous operation to finish
  *
  */

static void L_FLASH_Program_Word(uint32_t Address, uint32_t Data)
{
	// wait for any previous op to finish
	while(__HAL_FLASH_GET_FLAG(FLASH_FLAG_BSY) != RESET);
	// clear program size bits
	CLEAR_BIT(FLASH->CR, FLASH_CR_PSIZE);
	// reload program size bits
	FLASH->CR |= FLASH_PSIZE_WORD;
	// enable programming
	FLASH->CR |= FLASH_CR_PG;
	// write the data in
	*(volatile uint32_t*)Address = Data;
}


	// ===== Make sure we can erase and program sector 11 - the top one =====
	// If that works, the rest should work :)

	uint32_t error=0;
	uint32_t data;
	uint32_t address;

	// Erase sector
	L_HAL_FLASH_Unlock();
	L_FLASH_Erase_Sector(11);
	L_HAL_FLASH_Lock();

	// Check it is all FFs
	for (address=0x080e0000; address<0x080fffff; address+=4)
	{
		data=*(volatile uint32_t*)address;
		if ( data != 0xffffffff )
			error++;
	}

	// Erase sector again
	L_HAL_FLASH_Unlock();
	L_FLASH_Erase_Sector(11);

	// Fill it with data
	for (uint32_t i=0; i<(128*1024); i+=4)
	{
		L_FLASH_Program_Word(i+0x080e0000, i);
	}

	// Probably always best to lock the flash again before reading it
	L_HAL_FLASH_Lock();

	// Check the data we have just written
	data=0;
	for (address=0x080e0000; address<0x080fffff; address+=4)
	{
		if ( (*(volatile uint32_t*)address) != data )
			error++;
		data+=4;
	}

On the other matter, which was variables being held in registers and thus the debug mode being unable to display their values as you step through the code, am I right that this is rather useless (because half the variables you wanted to watch are not visible anymore, with -Og) but the only way around it is to use -O0?

I will look into those mem functions. I replaced memcpy with memmove (to shuffle the 7 byte string 1 byte left i.e. src and dest overlap) and it appears to be running fine.

ataradov · « **Reply #118 on:** August 12, 2021, 10:10:05 pm »

Quote from: peter-h on August 12, 2021, 09:58:54 pm

As regards other dodgy code, I wonder about this
.....
where addr is reading the CPU FLASH. Addr is being modified in the loop so could this be optimised out?

No, there is no way that could be optimized, assuming AT45dbxx_WritePage() ends in SPI access.

Quote from: peter-h on August 12, 2021, 09:58:54 pm

I then have this bit which tests the uppermost (128k) block of FLASH for erasure and programming. That is referencing FLASH memory addresses, but they are being declared volatile, but I wonder if correctly

The code looks fine to me.

Quote from: peter-h on August 12, 2021, 09:58:54 pm

On the other matter, which was variables being held in registers and thus the debug mode being unable to display their values as you step through the code, am I right that this is rather useless (because half the variables you wanted to watch are not visible anymore, with -Og) but the only way around it is to use -O0?

It depends on the debugger. I'm not sure if it is possible to make Eclipse show the value of variables held in registers. Low level debug info has this information.

But as a workaround when dealing with stuff like this, I just make dummy global variables and assign the values I need to see to those variables.

newbrain · « **Reply #119 on:** August 12, 2021, 11:25:23 pm »

Quote from: bson on August 12, 2021, 07:50:06 pm

They may also have arguments declared "restrict", meaning the compiler will try to detect if they overlap. For a function like this that would also be undesirable.

Quite the contrary.
The 'restrict' type qualifier is a contract that bounds the objects not to overlap*.

The compiler relies on the contract to be honoured, and this allows better optimizations.

If you break the contract, the program is broken (that is: Undefined Behaviour), there's no need for the compiler to check.

*In this simple case.
The complete definition is more complicated (C11: 6.7.3§8, 6.7.3.1).
Slightly more precisely, it says that an object pointed by a restrict pointer is not access through any other pointers (for the duration of the restrict pointer lifetime).

peter-h · « **Reply #120 on:** August 13, 2021, 10:27:53 am »

Is there some way to see what the compiler is removing? I think this came up before and basically the answer was No.

I've come up against a much more complicated problem: the USB logical drive won't even format. Basically this is a removable block device implementation supplied by ST, which calls just two functions that interface onto the serial FLASH: write block (with transparent erase) and read block. The whole thing is interrupt driven, though it starts off as an RTOS thread. It works with -O0 but not with -Og. The two FLASH funcs are widely used elsewhere and appear to work fine, which leaves a large chunk of impenetrable USB code... The two funcs are these but they call other stuff, some of which is barely penetrable ST lib code

Code: [Select]

// Write a number of bytes to a single page through the buffer with built-in erase.
// 1/8/21 The last parm is actually a linear address within the device. 
bool AT45dbxx_WritePage(
	const uint8_t *data,	// In	Data to write
	uint16_t len,			// In	Length of data to write (in bytes)
	uint32_t page			// In	linear address, a multiple of 512
) {

	HAL_StatusTypeDef status = HAL_OK;

	if (len==0) return (status == HAL_OK);

	page = page << AT45dbxx.Shift;
	at45dbxx_resume();
	at45dbxx_wait_busy();
	B_HAL_GPIO_WritePin(_45DBXX_CS_GPIO, _45DBXX_CS_PIN, GPIO_PIN_RESET);
	at45dbxx_tx_rx_byte(AT45DB_MNTHRUBF1);
	at45dbxx_tx_rx_byte((page >> 16) & 0xff);
	at45dbxx_tx_rx_byte((page >> 8) & 0xff);
	at45dbxx_tx_rx_byte(page & 0xff);
	status = B_HAL_SPI_Transmit(&_45DBXX_SPI, (uint8_t *) data, len, AT45DB_SPI_TIMEOUT_MS);
	B_HAL_GPIO_WritePin(_45DBXX_CS_GPIO, _45DBXX_CS_PIN, GPIO_PIN_SET);
	at45dbxx_wait_busy();

	return (status == HAL_OK);
}


// Read a number of bytes from a single page
// 1/8/21 The last parm is actually a linear address within the device. May have to be a
// multiple of 512, too.
// The device command used (0Bh) allows a continuous read of the whole device. This does not
// appear to be allowed here.

bool AT45dbxx_ReadPage(
	uint8_t *data,		// Out	Buffer to read data to
	uint16_t len,		// In	Length of data to read (in bytes)
	uint32_t page		// In	linear address, a multiple of 512
) {
	HAL_StatusTypeDef status = HAL_OK;

	if (len==0) return (status == HAL_OK);

	page = page << AT45dbxx.Shift;

	// Round down length to the page size
	if (len > AT45dbxx.PageSize) {
		len = AT45dbxx.PageSize;
	}

	at45dbxx_resume();
	at45dbxx_wait_busy();
	B_HAL_GPIO_WritePin(_45DBXX_CS_GPIO, _45DBXX_CS_PIN, GPIO_PIN_RESET);
	at45dbxx_tx_rx_byte(AT45DB_RDARRAYHF);
	at45dbxx_tx_rx_byte((page >> 16) & 0xff);
	at45dbxx_tx_rx_byte((page >> 8) & 0xff);
	at45dbxx_tx_rx_byte(page & 0xff);
	at45dbxx_tx_rx_byte(0);
	status = B_HAL_SPI_Receive(&_45DBXX_SPI, data, len, AT45DB_SPI_TIMEOUT_MS);
	B_HAL_GPIO_WritePin(_45DBXX_CS_GPIO, _45DBXX_CS_PIN, GPIO_PIN_SET);

	return (status == HAL_OK);
}

I can be very sure that all my "IO" references are volatile because I only ever use the ST defs and all are prefixed with __IO which is #defined as volatile.

To be honest I am happy to just continue the project with -O0 because it runs easily fast enough. It still removes code which manifestly never gets reached e.g. anything after a for( ;; ); but that's ok; I should not actually have any of that (except when testing when you want to insert say an LED flash loop near the start of main.c but don't want the other 99% of the product, including the RTOS and all the timer ISRs, stripped out

and then you have to fool the compiler by decrementing a unit32_t around that loop). With the other optimisation options I was seeing weird stuff where e.g. I had a variety of if() tests in a sequence, and at some point it decided that the expression being tested had to be always false, but I could not see that at all. Obviously the compiler could be way more clever than me in detecting an "impossible-true" situation, but the fact is that all those conditionals were working correctly. I think the compiler was deciding that some flags (various bytes in a 512-byte buffer read from an eeprom) would always test false. Maybe that whole buffer should have been volatile, but that would not make sense to me if it fixed it. The bigger issue is that so much code could be broken in this way and retesting absolutely everything one has written in 6 months is just not viable. And some 90% of this project is libs from ST and such (ETH and USB) which are practically impenetrable and take months to do any debugging on, and debugging -O issues is very slow since it involves stepping through code, which is often impossible because the code is real-time.

The code size with -O0 and -Og is 220k and 160k which is quite a lot and suggests that stuff is being removed whole, somewhere.

Maybe there is a specific compiler option one could try to narrow it down? Where would these be specified in Cube?

As an aside, could optimisation break code where you call a function which has say 4 parameters but the 4th one is never used inside that function?

Siwastaja · « **Reply #121 on:** August 13, 2021, 12:15:57 pm »

My suggestion: always compile with -O2, -O3 or -Os. If your program does not work, don't try if it would with -O0. Also don't try to look what code is "removed" by optimizer. Looking at assembly listing is sometimes needed, but shouldn't be your default strategy, either.

The reason why the code does not work is that the code is wrong. It's buggy. You need to apply all the "normal" debugging strategies to find the reason. Adding logging or facilitating debugger features brings you a long way.

You seem to be fixated to the idea that your program is fine or at least "almost" fine and it's the compiler that breaks it and then you need to kind of reverse-engineer what the compiler is doing to "break" the program then "adjust" the program until it gets "through" the compiler. This is all wrong, don't do it. But your way of debugging the issues amplifies this false premise.

Try more usual ways of dealing with bugs, even if they are bugs that only appear at certain compiler settings. If you decide not to mess up with the settings all the time for no reason, you simply don't know, and in this case it'd be a bliss, allowing you to treat bugs as bugs.

In other words, optimization is a tiny detail in how compiler works, but broken code is broken regardless of compiler settings - even if some setting, like -O0, by accident seemingly fixes the problem. By focusing on the optimization settings that are not the issue at all, you just waste your time.

newbrain · « **Reply #122 on:** August 13, 2021, 12:29:55 pm »

Quote from: peter-h on August 13, 2021, 10:27:53 am

As an aside, could optimisation break code where you call a function which has say 4 parameters but the 4th one is never used inside that function?

Remember that optimization only breaks broken code - but having ST lib in the recipe, that is quite possible.
You might try to not optimize only those parts by changing the C/C++ properties on the relevant folders (or files) in the project view, ~~IIRC~~ (yes that works, I checked).

The difference in size going from -O0 to -Og is similar to what I get on a project of similar volume (different MCU, an NXP iMX RT 1021), so no surprise there.

The read and write page functions look OK - when making some assumption on the parts not shown.

Quote from: Siwastaja on August 13, 2021, 12:15:57 pm

You seem to be fixated to the idea that your program is fine or at least "almost" fine and it's the compiler that breaks it and then you need to kind of reverse-engineer what the compiler is doing to "break" the program then "adjust" the program until it gets "through" the compiler. This is all wrong, don't do it. But your way of debugging the issues amplifies this false premise.

Yes, very much this. It's a backasswards strategy.

I personally have yet to encounter a real compiler bug - of course they exist but the usual suspects (gcc, clang and even cl) are quite good.

NorthGuy · « **Reply #123 on:** August 13, 2021, 01:23:06 pm »

Quote from: peter-h on August 13, 2021, 10:27:53 am

Is there some way to see what the compiler is removing? I think this came up before and basically the answer was No.

Sure. Look at the assembler generated (--save-temps switch in GCC). It'll show you exactly what compiler did.

However, this is not a good idea. You cannot rely on this. The code generated may change any time. Write to the standard and let the compiler do code generation.

peter-h · « **Reply #124 on:** August 13, 2021, 02:44:48 pm »

OK, yes, if I was writing something from scratch, then why not use O2 or O3. Then if something doesn't run you go straight to the last bit you did.

But here I have a load of code, most of it 3rd party libs, no support on any of it of course...

"Remember that optimization only breaks broken code - but having ST lib in the recipe, that is quite possible."

Exactly...


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

Author Topic: GCC compiler optimisation (Read 38151 times)

Share me