Author Topic: XC32/GCC C-to-ASM instruction ordering issue (Read 7282 times)

Chris C · « **on:** July 22, 2015, 11:21:29 pm »

In some parts of my code, I use short critical sections (all interrupts disabled) to avoid concurrency issues. But I find the C compiler is extending the time spent in critical sections, by putting things inside it that don't need to be there. Here's a specific example:

Code: [Select]

#define ZINLINE __attribute__((always_inline)) static __inline__

ZINLINE void MultiBitClear16(UI16 volatile * addr, UI16 mask) {
    mask=~mask; 
    UI32 _mbcs=__builtin_disable_interrupts(); 
    *addr&=mask;
    _mtc0(_CP0_STATUS, _CP0_STATUS_SELECT, _mbcs);
};   

ZINLINE void BitClear16(volatile UI16* addr, UNATIVE bit) { MultiBitClear16(addr,(1<<bit)); };

Calling BitClear16 and disassembling the code generated, I find that two steps (shift and invert) occur inside the critical section, that could have been done before it:

Code: [Select]

DI V1  // save status register and disable interrupts
EHB
LW A1, -32720(GP)  // load addr
SLLV A0, S3, V0  // mask=1<<bit
NOR A0, ZERO, A0 // mask=~mask
AND A0, A0, A1 // addr&=mask
SW A0, -32720(GP) // store addr
MTCO V1, Status  // restore prior status
EHB

And depending on the context in which BitClear16 is called, I suspect it could be worse. For example, if I called:

Code: [Select]

BitClear16(addr,bit+1);
There is nothing preventing the compiler from opting to do the +1 inside the critical section as well. It seems like ensuring operations occur in a particular order would be a common need, so I did some research, and found reference to this being done with a "compiler barrier":

Code: [Select]

asm volatile ("" ::: "memory")
But this doesn't help me. The offending instructions are performed on registers, so clobbering memory doesn't help. (I tried just to be sure.)

Is there some other hint I can provide to the compiler to prevent this? Preferably other than marking MultiBitClear16 with the 'noinline' attribute (adds overhead), or coding it in pure ASM (adds headaches)?

BloodyCactus · « **Reply #1 on:** July 23, 2015, 12:58:03 pm »

my guess is that the C standard does not say if you should resolve the expression as parameter before calling the function or during, but... since the function is inlined, it makes sense to resolve the expression inline with the function, so I dont see anything wrong with what the compiler is doing.

if your using xc32, the pic32 has a 5 stage pipeline, I dont know if they carried over pipeline reordering, but they do have the mips delay slots tho.

you might want to read section 50.4 on http://ww1.microchip.com/downloads/en/DeviceDoc/61192A.pdf regarding the hazards that trigger for reading/writeing to CP0 and see if any of those affect you.

Quote

It seems like ensuring operations occur in a particular order would be a common need, so I did some research, and found reference to this being done with a "compiler barrier":

well the problem is, this is a modern cpu core with full pipeline and a instruction prefetch cache and the compiler thinks it knows best when it doesnt

Quote

Is there some other hint I can provide to the compiler to prevent this? Preferably other than marking MultiBitClear16 with the 'noinline' attribute (adds overhead), or coding it in pure ASM (adds headaches)?

I would say, resolve your expression yourself before calling the inlined function.

Code: [Select]

uint32_t foo = 1 << bit;
MultiBitClear16(addr,foo);

you could maybe mess with trying to do trigger a branch delay slot.

you could also try '-fno-delayed-branch' but I dont know if that is active on xc32 gcc

there is also a ".set noreorder" instruction with a ".set reorder" to re-enable it you can try wrapping it with those asm statements.

the easiest thing to do I expect would just be, resolve your expression before passing it as a parameter.

Jeroen3 · « **Reply #2 on:** July 23, 2015, 01:08:03 pm »

What happens when you do this:

Code: [Select]

ZINLINE void MultiBitClear16(UI16 volatile * addr, UI16 mask) {
    mask=~mask; 
    UI32 _mbcs=__builtin_disable_interrupts(); 
    {
    *addr&=mask;
    }
    _mtc0(_CP0_STATUS, _CP0_STATUS_SELECT, _mbcs);
};

The alternative is to write MultiBitClear16 in asm completely.

Kalvin · « **Reply #3 on:** July 23, 2015, 01:13:29 pm »

What happens if you declare mask and _mbcs as volatile?

edavid · « **Reply #4 on:** July 23, 2015, 03:19:49 pm »

What is the load latency of the XC32? If it's 2 cycles or more, those instructions won't affect the interrupt latency.

Kalvin · « **Reply #5 on:** July 23, 2015, 03:58:18 pm »

BloodyCactus's and edavid's comments above about pipeline delays and slots may contain same wisdom here.

Here's an explanation on how to use the volatile and memory barrier in GCC:
https://gcc.gnu.org/onlinedocs/gcc/Volatiles.html

Chris C · « **Reply #6 on:** July 23, 2015, 08:49:13 pm »

I tried the simplest suggestions first:

[BloodyCactus] - Wrapping the critical section with asm ".set noreorder" and ".set reorder" had no effect.
[Jeroen3] - Adding the extra brackets had no effect.
[Kalvin] - Declaring mask volatile moved the two unwanted instructions out of the critical section, but added two others, negating any gain. Declaring _mbcs volatile had no effect, alone or in combination with mask.

So it seems there's no fix that's easy, or without some trade-off. Therefore:

[BloodyCactus], [edavid], [Kalvin] - Based on your suggestions, at this point I think it's best that I measure actual interrupt latency, to determine the real-world impact of the example scenario and others like it. It may not be as bad as I expected.

Depending on the results, I'll proceed through the remaining suggestions if needed.

Thank you, everyone!

andersm · « **Reply #7 on:** July 23, 2015, 09:38:11 pm »

If you expanded your mask to 32 bits, you could use atomics and avoid disabling interrupts at all.

BloodyCactus · « **Reply #8 on:** July 24, 2015, 01:32:44 am »

Quote from: Chris C on July 23, 2015, 08:49:13 pm

I tried the simplest suggestions first:

did you try removing the expression as a parameter to just a variable?

Code: [Select]

uint32_t foo = 1 << bit;
MultiBitClear16(addr,foo);

ale500 · « **Reply #9 on:** July 24, 2015, 04:26:31 am »

Just curious, at what level of optimization are you working ?

Chris C · « **Reply #10 on:** July 24, 2015, 05:20:35 am »

[andersm] - You mentioned atomics recently in another post. I haven't forgotten! Currently I'm working on converting a hardware abstraction layer (HAL). Functions like MultiBitClear16 are part of the HAL services. Many of the calls to them were within the HAL itself, for hardware register manipulation, and have already been converted to register SET/CLR/INV as per your advice. The remaining calls are for structures defined by programs expected to work with either the 16-bit or 32-bit version of the HAL, so making everything 32-bit is not really an option in my case.

[BloodyCactus] - Oops, I did try that and forgot to say. No effect. Since everything is inlined, the compiler optimizes away the intermediate variable 'foo'. Unless I 'noinline' MultiBitClear16, which is one of the things I was trying to avoid. Or 'volatile foo', which results in the same outcome as [Kalvin]'s suggestion. Also tried a few variations on this theme involving recasts, but the compiler still outsmarted me.

[ale500] - Globally, -O2 and MIPS16 code generation enabled. ISRs and speed-critical functions are selectively marked with the 'nomips16' attribute. It seems that for inline functions, the compiler completely ignores 'nomips16', and possibly other directives as well; using the options of the calling function instead. For all tests related to this issue, the calling function is marked 'nomips16'.

mikerj · « **Reply #11 on:** July 24, 2015, 07:52:06 am »

Did you try putting a memory barrier after the mask invert?

Chris C · « **Reply #12 on:** July 24, 2015, 05:16:18 pm »

[Mikerj] - Yes, mentioned but not detailed in my original post. Tried two memory barriers, one after the mask invert, and one after the bit shift. No effect.

The good news - initial profiling of interrupt latency shows that it should be tolerable. There are other sources of latency in my code, probably unavoidable, which exceed that of my example and don't seem to be additive to it. So unless I find the compiler really gets carried away at some point, and shifts something more costly into a critical section, I can live with it.

Kalvin · « **Reply #13 on:** July 24, 2015, 05:39:08 pm »

Good analysis, Chris C!

If you have time & interest test this, does this generate any difference:

Code: [Select]

ZINLINE void MultiBitClear16(UI16 volatile * addr, UI16 mask) {
    do {
        mask=~mask;
    } while 0;
    UI32 _mbcs=__builtin_disable_interrupts();
    do {
        *addr&=mask;
    } while 0;
    _mtc0(_CP0_STATUS, _CP0_STATUS_SELECT, _mbcs);
};

Chris C · « **Reply #14 on:** July 24, 2015, 06:11:20 pm »

[Kalvin] - I'm still interested for curiosity's sake, and will continue to try all suggestions. Tested your code, with both instances of 'while 0' changed to 'while (0)'. No effect, disassembly remains identical.

Kalvin · « **Reply #15 on:** July 24, 2015, 06:18:24 pm »

Quote from: Chris C on July 24, 2015, 06:11:20 pm

[Kalvin] - I'm still interested for curiosity's sake, and will continue to try all suggestions. Tested your code, with both instances of 'while 0' changed to 'while (0)'. No effect, disassembly remains identical.

Thanks. That was just an idea if the compiler would have considered the loop as a single unit (or sort of barrier) for the code generation. Obviously not.

Ps. Sorry I missed the obvious parenthesis in while (0).


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

EEVblog Electronics Community Forum

Author Topic: XC32/GCC C-to-ASM instruction ordering issue (Read 7282 times)

Chris C

XC32/GCC C-to-ASM instruction ordering issue

BloodyCactus

Re: XC32/GCC C-to-ASM instruction ordering issue

Jeroen3

Re: XC32/GCC C-to-ASM instruction ordering issue

Kalvin

Re: XC32/GCC C-to-ASM instruction ordering issue

edavid

Re: XC32/GCC C-to-ASM instruction ordering issue

Kalvin

Re: XC32/GCC C-to-ASM instruction ordering issue

Chris C

Re: XC32/GCC C-to-ASM instruction ordering issue

andersm

Re: XC32/GCC C-to-ASM instruction ordering issue

BloodyCactus

Re: XC32/GCC C-to-ASM instruction ordering issue

ale500

Re: XC32/GCC C-to-ASM instruction ordering issue

Chris C

Re: XC32/GCC C-to-ASM instruction ordering issue

mikerj

Re: XC32/GCC C-to-ASM instruction ordering issue

Chris C

Re: XC32/GCC C-to-ASM instruction ordering issue

Kalvin

Re: XC32/GCC C-to-ASM instruction ordering issue

Chris C

Re: XC32/GCC C-to-ASM instruction ordering issue

Kalvin

Re: XC32/GCC C-to-ASM instruction ordering issue

Share me