Author Topic: Identify this GCC optimization (Read 3441 times)

ataradov · « **on:** October 15, 2017, 09:59:37 pm »

So I'm running into an interesting optimization GCC does. I really don't understand it, and it would be nice to see an explanation. This has been first identified on MIPS, but here I'm showing examples on ARM for simplicity.

So here is the code:

Code: [Select]

 
int main(void)
{
  int cnt = 0;
  int cnt1 = 0;

  while (1)
  {
    if (cnt++ == 2000000)
    {
      *(volatile uint32_t *)0x00020000 = cnt1++;
      cnt = 0;
    }

    //asm("nop"); // <- This line is important
  }
}

The idea is to have some delay, and write a slowly incrementing value into a memory mapped location.

When compiled with -O1, I get the following code:

Code: [Select]

00000008 <main>:
   8:	2200      	movs	r2, #0
   a:	4805      	ldr	r0, [pc, #20]	; (20 <main+0x18>)
   c:	2180      	movs	r1, #128	; 0x80
   e:	0289      	lsls	r1, r1, #10
  10:	1c03      	adds	r3, r0, #0
  12:	3b01      	subs	r3, #1
  14:	2b00      	cmp	r3, #0
  16:	d1fc      	bne.n	12 <main+0xa>
  18:	600a      	str	r2, [r1, #0]
  1a:	3201      	adds	r2, #1
  1c:	e7f8      	b.n	10 <main+0x8>
  1e:	46c0      	nop			; (mov r8, r8)
  20:	001e8481 	.word	0x001e8481

Which is basically correct.

But if compiled with a higher level of optimization (-O2, -O3, -Os), I get this:

Code: [Select]

00000008 <main>:
   8:	2300      	movs	r3, #0
   a:	2280      	movs	r2, #128	; 0x80
   c:	0292      	lsls	r2, r2, #10
   e:	6013      	str	r3, [r2, #0]
  10:	3301      	adds	r3, #1
  12:	e7fa      	b.n	a <main+0x2>

It completely eliminated the delay loop.

But if you uncomment the "nop", it does produce full code.

The logic behind this optimization is clear, I just wan to identify if there is GCC specific type of optimization that causes this. Going through all flags is probably an option, but may be someone knows this already?

nctnico · « **Reply #1 on:** October 15, 2017, 10:51:04 pm »

This link may be helpfull.
https://stackoverflow.com/questions/10809429/what-kind-of-dead-code-can-gcc-eliminate-from-the-final-output
I think you are looking at dead code removal.

ataradov · « **Reply #2 on:** October 15, 2017, 10:55:46 pm »

It is not really "dead code", since it just eliminates extra loops over the same live code. Anyway, none of the flags mentioned there did anything in this case.

hans · « **Reply #3 on:** October 15, 2017, 10:59:17 pm »

Could be many mechanics at play here, but I think a core one is:

https://en.wikipedia.org/wiki/Loop-invariant_code_motion

With the Nop you're breaking hoare logic mechanisms, as it cannot establish weakest preconditions for that assembly statement.

cv007 · « **Reply #4 on:** October 15, 2017, 11:10:34 pm »

I'm sure you know this already, but
int cnt = 0; -> volatile int cnt = 0;
since this is what you don't want eliminated

I would also suspect an asm("") would do the same as your asm("nop")

I can't answer your question on what optimization this falls under, though

ataradov · « **Reply #5 on:** October 16, 2017, 12:35:05 am »

Quote from: cv007 on October 15, 2017, 11:10:34 pm

int cnt = 0; -> volatile int cnt = 0;

Well, yes, this is obvious once you see the output.

Quote from: cv007 on October 15, 2017, 11:10:34 pm

also, forgive me if I change your code (nothing wrong with yours, but it is a little bit of a mind bender at first glance ,I'm not sure why, though but it may be I'm not too bright )-

The whole issue resulted from removing a bunch of stuff from a super loop, so that nop represents a ton of code that should not be blocked, as it would be with your change.

joshtyler · « **Reply #6 on:** October 16, 2017, 12:35:42 am »

As mentioned by cv007, the simple solution is to make cnt volatile, but it seems as though you are more interested as to why the compiler makes the optimisation.

When you write C code, the nominal goal of the compiler is to figure out what the output of your code will be and produce the fastest/smallest implementation that will achieve the intended result. The compiler can re-order your code, completely remove chunks, do whatever it wants so long as the output is unchanged.

The actual definition of the volatile keyword can be confusing, and actually slightly vague in one technicality. When it boils down the way I find helps me to picture it is to imagine that the volatile variable is being used in a printf() or scanf() call every time it is read/written to/from. The compiler must guarantee that someone looking at the output of that printf() would see an unchanged behaviour as a result of the compiler optimisations. In your case the optimised code still provides an increasing count on the volatile register, and so the compiler thinks that it has done an amazing job in allowing that to happen really quickly by removing your outer loop.

Adding the asm("nop"); is a somewhat "dangerous" way to fix the problem. GCC is obliged to put a nop at that point in the code, so it forces it to put the entire loop in there, but GCC is still perfectly entitled to move around that asm command around with other code statements in unexpected ways. It makes no difference in this case, but could trip up more complex code. The more elegant solution (only in my opinion) is to make cnt volatile, so that the language definition forces the compiler to create the delay loop you want.

There's a very good episode of Embedded with Matt Goldbolt, author of the excellent Compiler Explorer, where he discusses writing compilers, and how compilers are able to prove that they are allowed to optimise code http://embedded.fm/episodes/190.

As you suspected, if you are very keen to know exactly which GCC flag(s) enable this optimisation, then you'll probably have to go through them one at a time.

ataradov · « **Reply #7 on:** October 16, 2017, 12:38:10 am »

I know why this happens and why fixes work. I'm just interested in a specific type of optimization that did this. I'll just go over all of them once I have more time.

ataradov · « **Reply #8 on:** October 16, 2017, 01:35:36 am »

So I removed "-O" flag and added all the optimization options as a huge list. And the code is bigger that even -O1.

In fact, the code is very different between -O1 and plain listing flags that are mentioned in the documentation for -O1.

Some more digging is required, I guess.

ataradov · « **Reply #9 on:** October 16, 2017, 01:47:45 am »

Doing things in reverse, and enabling -O2, but then prefixing all options with -fno-xxx, I found that the option responsible for this is

Quote

-ftree-dce Perform dead code elimination (DCE) on trees. This flag is enabled by default at -O and higher.

So it looks like this falls under dead code for GCC, at least at some stage.

andersm · « **Reply #10 on:** October 16, 2017, 10:00:05 am »

IIRC, GCC was changed to special-case inline assembly statements with no inputs, outputs, or side-effects precisely so it won't optimize away nops. Can't find a reference for this offhand, though.

ataradov · « **Reply #11 on:** October 16, 2017, 02:53:19 pm »

Quote from: andersm on October 16, 2017, 10:00:05 am

IIRC, GCC was changed to special-case inline assembly statements with no inputs, outputs, or side-effects precisely so it won't optimize away nops. Can't find a reference for this offhand, though.

This is my experience, and I'm relying on it in day to day work.

rsjsouza · « **Reply #12 on:** October 16, 2017, 04:02:07 pm »

Quote from: ataradov on October 16, 2017, 01:47:45 am

Doing things in reverse, and enabling -O2, but then prefixing all options with -fno-xxx, I found that the option responsible for this is
Quote
-ftree-dce Perform dead code elimination (DCE) on trees. This flag is enabled by default at -O and higher.

So it looks like this falls under dead code for GCC, at least at some stage.

In my experience, I always got these types of loops suppressed when using high optimization levels - although I don't use GCC extensively in higher optimization codes, other compilers indeed treated this as a dead code.

Quote from: ataradov on October 16, 2017, 02:53:19 pm

Quote from: andersm on October 16, 2017, 10:00:05 am
IIRC, GCC was changed to special-case inline assembly statements with no inputs, outputs, or side-effects precisely so it won't optimize away nops. Can't find a reference for this offhand, though.
This is my experience, and I'm relying on it in day to day work.

In my experience, "C" optimizers will leave the function code alone the moment they find an assembly inline instruction - they don't go to the extend of evaluating what is inside the ASM code.

andersm · « **Reply #13 on:** October 16, 2017, 04:37:49 pm »

Quote from: rsjsouza on October 16, 2017, 04:02:07 pm

In my experience, "C" optimizers will leave the function code alone the moment they find an assembly inline instruction - they don't go to the extend of evaluating what is inside the ASM code.

The point of GCC's extended inline assembly syntax is to provide the compiler with some insight into the assembly code (inputs, outputs, clobbers), and let it optimize around it.

Jeroen3 · « **Reply #14 on:** October 23, 2017, 06:18:15 pm »

It's funny. I saw a delay like this on /r/programminghorror today.

ataradov · « **Reply #15 on:** October 23, 2017, 06:20:19 pm »

Quote from: Jeroen3 on October 23, 2017, 06:18:15 pm

It's funny. I saw a delay like this on /r/programminghorror today.

There is absolutely noting wrong with delays like this on their own. And in that case it is actually done right and ii is declared volatile.

It is very easy to over-complicate things because of some ideas that people consider "good", without ever questioning them.

rsjsouza · « **Reply #16 on:** October 23, 2017, 08:36:52 pm »

Quote from: andersm on October 16, 2017, 04:37:49 pm

Quote from: rsjsouza on October 16, 2017, 04:02:07 pm
In my experience, "C" optimizers will leave the function code alone the moment they find an assembly inline instruction - they don't go to the extend of evaluating what is inside the ASM code.
The point of GCC's extended inline assembly syntax is to provide the compiler with some insight into the assembly code (inputs, outputs, clobbers), and let it optimize around it.

Not at all, at least in the compilers and optimizers I work with. The "C" compiler and the "C" optimizer have zero visibility about any assembly inline and act conservatively when faced with one of these constructs.

You may be thinking about intrinsics, which are "C" constructs that have a 1:1 correlation with opcodes.

Quote from: ataradov on October 23, 2017, 06:20:19 pm

Quote from: Jeroen3 on October 23, 2017, 06:18:15 pm
It's funny. I saw a delay like this on /r/programminghorror today.
There is absolutely noting wrong with delays like this on their own. And in that case it is actually done right and ii is declared volatile.

It is very easy to over-complicate things because of some ideas that people consider "good", without ever questioning them.

+1. The code shown is perfectly fine.

ataradov · « **Reply #17 on:** October 23, 2017, 08:40:49 pm »

Quote from: rsjsouza on October 23, 2017, 08:36:52 pm

Not at all, at least in the compilers and optimizers I work with. The "C" compiler and the "C" optimizer have zero visibility about any assembly inline and act conservatively when faced with one of these constructs.

This.

GCC has the best inline assembly syntax out there, which makes integration of C and assembly easy and straightforward. But those inputs, outputs and clobbers lists are for C to adjust to the assembly, not the other way around.

The only optimization on assembly that I know of, is instruction move into the branch delay slot on MIPS. MIPS assembler from GCC can do this on its own.

andersm · « **Reply #18 on:** October 23, 2017, 09:31:56 pm »

Quote from: rsjsouza on October 23, 2017, 08:36:52 pm

Not at all, at least in the compilers and optimizers I work with. The "C" compiler and the "C" optimizer have zero visibility about any assembly inline and act conservatively when faced with one of these constructs.

You may be thinking about intrinsics, which are "C" constructs that have a 1:1 correlation with opcodes.

No, other compilers at least used to take the conservative approach that inline assembly could have completely altered the program state and eg. reload all variables from memory. GCC instead makes you annotate what the you mody, allowing the compiler to be less pessimistic around the inline assembly. GCC can also allocate registers for the inline assembly, again allowing it to produce better code.


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

Author Topic: Identify this GCC optimization (Read 3441 times)

Share me