So I'm running into an interesting optimization GCC does. I really don't understand it, and it would be nice to see an explanation. This has been first identified on MIPS, but here I'm showing examples on ARM for simplicity.
So here is the code:
int main(void)
{
int cnt = 0;
int cnt1 = 0;
while (1)
{
if (cnt++ == 2000000)
{
*(volatile uint32_t *)0x00020000 = cnt1++;
cnt = 0;
}
//asm("nop"); // <- This line is important
}
}
The idea is to have some delay, and write a slowly incrementing value into a memory mapped location.
When compiled with -O1, I get the following code:
00000008 <main>:
8: 2200 movs r2, #0
a: 4805 ldr r0, [pc, #20] ; (20 <main+0x18>)
c: 2180 movs r1, #128 ; 0x80
e: 0289 lsls r1, r1, #10
10: 1c03 adds r3, r0, #0
12: 3b01 subs r3, #1
14: 2b00 cmp r3, #0
16: d1fc bne.n 12 <main+0xa>
18: 600a str r2, [r1, #0]
1a: 3201 adds r2, #1
1c: e7f8 b.n 10 <main+0x8>
1e: 46c0 nop ; (mov r8, r8)
20: 001e8481 .word 0x001e8481
Which is basically correct.
But if compiled with a higher level of optimization (-O2, -O3, -Os), I get this:
00000008 <main>:
8: 2300 movs r3, #0
a: 2280 movs r2, #128 ; 0x80
c: 0292 lsls r2, r2, #10
e: 6013 str r3, [r2, #0]
10: 3301 adds r3, #1
12: e7fa b.n a <main+0x2>
It completely eliminated the delay loop.
But if you uncomment the "nop", it does produce full code.
The logic behind this optimization is clear, I just wan to identify if there is GCC specific type of optimization that causes this. Going through all flags is probably an option, but may be someone knows this already?