STM32: Macros vs functions in interrupts.

STM32: Macros vs functions in interrupts.
Posted by Pack34 on 19 Oct, 2016 15:06
When working with the HAL libraries on an STM32 using the OpenSTM32 workbench I noticed a substantial performance hit when calling a function from an interrupt. It seemed on the magnitude of 800 clock cycles.

Now, on another project I'm working with the CMSIS libraries on an STM32F4. In the current revision of the firmware functions are called to setup a DAC. I'm looking at changing this up so that a Macro is called instead in an attempt to improve the amount of clock cycles it takes to complete the interrupt.

Is it correct to say that the compiler simply takes the code behind the macro and substitutes it where it's called in the interrupt. So no additional stack work is done, like what happens when you call a function?

Timing in this situation is critical due to the interrupt is being driven at a maximum of 320-kHz. This is indeed required by the specific application.

#1 Reply
Posted by rstofer on 19 Oct, 2016 15:42
Quote from: Pack34 on 19 Oct, 2016 15:06
Is it correct to say that the compiler simply takes the code behind the macro and substitutes it where it's called in the interrupt. So no additional stack work is done, like what happens when you call a function?

Yes

You can look at the assembly code to see the difference.

#2 Reply
Posted by Pack34 on 19 Oct, 2016 15:58
Thank you. I wanted to be sure to double check.

#3 Reply
Posted by langwadt on 19 Oct, 2016 16:16
Quote from: Pack34 on 19 Oct, 2016 15:06
When working with the HAL libraries on an STM32 using the OpenSTM32 workbench I noticed a substantial performance hit when calling a function from an interrupt. It seemed on the magnitude of 800 clock cycles.

Now, on another project I'm working with the CMSIS libraries on an STM32F4. In the current revision of the firmware functions are called to setup a DAC. I'm looking at changing this up so that a Macro is called instead in an attempt to improve the amount of clock cycles it takes to complete the interrupt.

Is it correct to say that the compiler simply takes the code behind the macro and substitutes it where it's called in the interrupt. So no additional stack work is done, like what happens when you call a function?

Timing in this situation is critical due to the interrupt is being driven at a maximum of 320-kHz. This is indeed required by the specific application.

lookup inline instead

and the is no way a call is adding 800 cycles, entering an interrupt is something like 30 cycles with floating point registers

#4 Reply
Posted by andersm on 19 Oct, 2016 16:27
Quote from: Pack34 on 19 Oct, 2016 15:06
When working with the HAL libraries on an STM32 using the OpenSTM32 workbench I noticed a substantial performance hit when calling a function from an interrupt. It seemed on the magnitude of 800 clock cycles.
Calling a function doesn't cost 800 cycles. The function you called may have cost 800 cycles. There is a big difference.

The function call overhead on a Cortex-M should be on the order of ten cycles. Any decent compiler will inline static functions that are defined above the call site, and that are called only once.

#5 Reply
Posted by Sal Ammoniac on 19 Oct, 2016 17:48
You shouldn't be doing anything in an interrupt handler that takes a long time. Use the interrupt handler to signal something that performs the required work at thread level after the interrupt has been dismissed.

#6 Reply
Posted by John Coloccia on 19 Oct, 2016 17:54
If it's still giving you trouble, post the function you're calling.

edit:
BTW, my thought is that you're inadvertently invoking something that's sucking up tons of time, or doing something else that's bad. Even full on, C++ style virtual calls are lightweight though there is definitely an overall impact from less optimizations being possible.

#7 Reply
Posted by lem_ix on 19 Oct, 2016 18:33
Don't know about HAL but there is a small gotcha with macros, they paste text so watch out for formatting like brackets etc if you're using complex constructs. Maybe wrap in do{}while(0).

#8 Reply
Posted by mikeselectricstuff on 19 Oct, 2016 18:43
Quote from: Pack34 on 19 Oct, 2016 15:06
Timing in this situation is critical due to the interrupt is being driven at a maximum of 320-kHz. This is indeed required by the specific application.
..in which case you probably shouldn't be using functions macros whose contents you don't necessarily know, but writing your own code to do exactly what you want with no wastage.

#9 Reply
Posted by nctnico on 19 Oct, 2016 18:50
Compiling using speed optimisation (-O3 in gcc) is also a good idea. It will make the code larger but also a lot faster.

#10 Reply
Posted by Kleinstein on 19 Oct, 2016 20:09
Usually using a function call should not give a 800 cycles penalty - more like 10-30 cycles. There might be extreme cases, if the function is using dynamic memory, that might get a different kind of optimization when used as direct code.

#11 Reply
Posted by Pack34 on 20 Oct, 2016 00:15
Quote from: andersm on 19 Oct, 2016 16:27
Quote from: Pack34 on 19 Oct, 2016 15:06
When working with the HAL libraries on an STM32 using the OpenSTM32 workbench I noticed a substantial performance hit when calling a function from an interrupt. It seemed on the magnitude of 800 clock cycles.
Calling a function doesn't cost 800 cycles. The function you called may have cost 800 cycles. There is a big difference.

The function call overhead on a Cortex-M should be on the order of ten cycles. Any decent compiler will inline static functions that are defined above the call site, and that are called only once.

The called function is only ~12 lines of code setting up GPIO pins. Using HAL when this seemed to cause collisions in the interrupt. At the time it seemed like it would have taken a couple hundred clock cycles for that to happen. This time it was the same exercise using the CMSIS libraries. I know I won't be able to get upwards of the 260-kHz that I want on the F4 but I was able to squeeze out 140-kHz, which is enough for now until I get around to laying out the F7 board.

Using the direct code on the HAL F7 substantially improved performance, I was hoping the same with the F4 but using Macros instead of copy and pasting the same 12 lines of code throughout the interrupt. The same code is executed at about 8 different locations depending on the situation.

#12 Reply
Posted by langwadt on 20 Oct, 2016 00:28
Quote from: Pack34 on 20 Oct, 2016 00:15
Quote from: andersm on 19 Oct, 2016 16:27
Quote from: Pack34 on 19 Oct, 2016 15:06
When working with the HAL libraries on an STM32 using the OpenSTM32 workbench I noticed a substantial performance hit when calling a function from an interrupt. It seemed on the magnitude of 800 clock cycles.
Calling a function doesn't cost 800 cycles. The function you called may have cost 800 cycles. There is a big difference.

The function call overhead on a Cortex-M should be on the order of ten cycles. Any decent compiler will inline static functions that are defined above the call site, and that are called only once.

The called function is only ~12 lines of code setting up GPIO pins. Using HAL when this seemed to cause collisions in the interrupt. At the time it seemed like it would have taken a couple hundred clock cycles for that to happen. This time it was the same exercise using the CMSIS libraries. I know I won't be able to get upwards of the 260-kHz that I want on the F4 but I was able to squeeze out 140-kHz, which is enough for now until I get around to laying out the F7 board.

Using the direct code on the HAL F7 substantially improved performance, I was hoping the same with the F4 but using Macros instead of copy and pasting the same 12 lines of code throughout the interrupt. The same code is executed at about 8 different locations depending on the situation.

what are those 12 lines of code? setting up IO with any of the usual libraries could (will) do a crap ton of checking and possible
call function to do so

#13 Reply
Posted by Pack34 on 20 Oct, 2016 13:07
Quote from: langwadt on 20 Oct, 2016 00:28
what are those 12 lines of code? setting up IO with any of the usual libraries could (will) do a crap ton of checking and possible
call function to do so

Direct port manipulation. Setting up a uint16 variable based on an input value and control bits, then pushing it into a GPIO register, clearing, and repeating.

The contents of the function isn't an issue. It's as streamlined as possible. The issue was the best way to write the code so it exists once and used in multiple places. With functions there will always be stack work. I wanted to be certain on how the compiler treats a macro in order to minimize the amount of clock cycles it takes to complete the interrupt. I am going to be triggering it as fast as possible before it starts to break down.

#14 Reply
Posted by Kalvin on 20 Oct, 2016 13:25
If your compiler supports inline functions and/or templates, you can produce as efficient code as you can do with the macros. The bonus is better type checking and the source code is easier to maintain. With the templates you can do pretty sophisticated stuff:
http://www.embedded.com/design/programming-languages-and-tools/4442876/C---template-metaprogramming-for-AVR-microcontrollers
With a poor template design, it is also extremely very easy to produce poor code. So, please check the compiler assembly output from time to time in order to see whether the template produces the code you are expecting.

#15 Reply
Posted by mikeselectricstuff on 20 Oct, 2016 14:43
Quote from: Pack34 on 20 Oct, 2016 13:07
Quote from: langwadt on 20 Oct, 2016 00:28
what are those 12 lines of code? setting up IO with any of the usual libraries could (will) do a crap ton of checking and possible
call function to do so

Direct port manipulation. Setting up a uint16 variable based on an input value and control bits, then pushing it into a GPIO register, clearing, and repeating.

The contents of the function isn't an issue. It's as streamlined as possible. The issue was the best way to write the code so it exists once and used in multiple places. With functions there will always be stack work. I wanted to be certain on how the compiler treats a macro in order to minimize the amount of clock cycles it takes to complete the interrupt. I am going to be triggering it as fast as possible before it starts to break down.
Macros will be expanded every time - this happens in the preprocessor before the compiler even sees it.

#16 Reply
Posted by John Coloccia on 20 Oct, 2016 17:26
Sometime a function (in particular static functions) will be more efficient than a macro because it allows the compiler to make optimizations that it would otherwise not make.

#17 Reply
Posted by nctnico on 20 Oct, 2016 18:26
Quote from: John Coloccia on 20 Oct, 2016 17:26
Sometime a function (in particular static functions) will be more efficient than a macro because it allows the compiler to make optimizations that it would otherwise not make.
Why is that? Macros are expanded by the preprocessor (cpp in the GNU tools) which is nothing more than a sophisticated text search & replace. After cpp is done the resulting text is fed into the C compiler. All in all an equally well written inline function or macro shouldn't make any difference.

#18 Reply
Posted by newbrain on 20 Oct, 2016 18:41
Quote from: lem_ix on 19 Oct, 2016 18:33
Don't know about HAL but there is a small gotcha with macros, they paste text so watch out for formatting like brackets etc if you're using complex constructs. Maybe wrap in do{}while(0).
That's the first trap for young players, the second one, still due to the "blind" text substitution, is this:
Code: [Select]
#define do_something( a ) do { do_this( a ); do_that( a ); } while (0) uint16_t mustIncrementOnce = 0; do_something( mustIncrementOnce++ ); // Oops!
So, one of the best advice so far is to use static inline functions.

BUT! To me it's still unclear what's gobbling all that cycles.

Before optimizing, using inlines etc. it would be better to understand what's happening.
Even using the HAL provided IRQ handlers and callbacks, it should not be so terrible.

To the OP: would it be possible to post your code (maybe altered enough not to give away any possible IP/trade secret)?
Which interrupt handler is being used?
A timer one would be a fair guess, as an exact rate seems to be the aim.

If you are writing to the DAC (but that's not clear), have you considered the use of DMA?

#19 Reply
Posted by John Coloccia on 20 Oct, 2016 18:45
Quote from: nctnico on 20 Oct, 2016 18:26
Quote from: John Coloccia on 20 Oct, 2016 17:26
Sometime a function (in particular static functions) will be more efficient than a macro because it allows the compiler to make optimizations that it would otherwise not make.
Why is that? Macros are expanded by the preprocessor (cpp in the GNU tools) which is nothing more than a sophisticated text search & replace. After cpp is done the resulting text is fed into the C compiler. All in all an equally well written inline function or macro shouldn't make any difference.

If the compiler knows it's a function, and it's called multiple times, it's much easier to take advantage of things such as instruction/data caches, optimizing memory access/pipelines and things like that. May or may not make a difference depending on the specific compiler and platform, but it's not always obvious what will actually speed up vs slow down your code.

This is why I recommend that people write clean, simple code and not worry too much about optimizing other than avoiding patterns that make optimization impossible if it's important that something runs quickly. Often times the compiler will do a better job than you can, especially if you have a good feel for how to write the code so that it works in harmony with what the compiler is good at doing.

Of course, the opposite is also true. Sometimes the most difficult to find bugs are ones where the compiler has optimized code and broken something. If things absolutely have to be done in a certain order or a certain way, sometimes you have to turn optimizations off completely, or take steps to force the issue. Proper use of the volatile keyword comes to mind in C as an example.

#20 Reply
Posted by IanB on 20 Oct, 2016 18:54
Why would anyone ever write:
Code: [Select]
do { ... } while (0) What's the point?

(In other words, why not just write:
Code: [Select]
{ ... } without the redundant do and while bits?)

#21 Reply
Posted by John Coloccia on 20 Oct, 2016 19:05
Quote from: IanB on 20 Oct, 2016 18:54
Why would anyone ever write:
Code: [Select]
do { ... } while (0) What's the point?

(In other words, why not just write:
Code: [Select]
{ ... } without the redundant do and while bits?)

Because the first way, this works:

if (x)
MACRO(a);
else
somethingElse();

and the second way it doesn't, because if you just do {}, the macro expands to
if(x)
{};
else
somethingElse();

Basically, it makes the macro somewhat behave like a function so it doesn't look funny when you don't have a semicolon at the end. Personally, I don't really like it because it obfuscates things, but it's pretty common.

I tend to always use braces anyway, unless something takes the form of:

if (x) func();

No else...all fits neatly on one line...makes perfect sense and looks pretty, but I tend to avoid that too unless it somehow makes the code MORE readable, like:

if (func1()) thing1();
if (func2()) thing2();
if (func3()) thing3();

Comes up sometimes, but I'd just as soon just use multiple lines and braces and avoid the issue altogether. Defensive programming.

#22 Reply
Posted by Bruce Abbott on 20 Oct, 2016 19:15
Quote from: nctnico on 20 Oct, 2016 18:26
equally well written inline function or macro
But what if they aren't?

Whether one method or another is more efficient depends on how they are implemented and what the compiler does with them. Seemingly subtle differences between the code in an expanded macro and an inline function could make large differences in code size and execution speed. So unless you can guarantee that they both produce identical source code after pre-processing, the 'equally well written' proviso - while technically true - is not useful.

Quote from: newbrain
Before optimizing, using inlines etc. it would be better to understand what's happening.
Yep. View the machine code to see what the compiler actually produced. Everything else is speculation.

Quote from: Pack34
The contents of the function isn't an issue. It's as streamlined as possible.
And you know this how?

#23 Reply
Posted by hans on 20 Oct, 2016 20:11
Quote from: Pack34 on 19 Oct, 2016 15:06
When working with the HAL libraries on an STM32 using the OpenSTM32 workbench I noticed a substantial performance hit when calling a function from an interrupt. It seemed on the magnitude of 800 clock cycles.

If you want to know it more accurately, you could set up a internal timer ready-to-go in your main code. Then clear&start it when you enter the ISR and stop it when you leave it. The timer value then contains the cycles (or multiple of) taken which can be read while running your main code.
Alternatively you could set/clear a GPIO upon entering/leaving the ISR and measure it's timing with a scope or LA.
You could obviously also narrow this down around particular parts of the code.

Quote
Now, on another project I'm working with the CMSIS libraries on an STM32F4. In the current revision of the firmware functions are called to setup a DAC. I'm looking at changing this up so that a Macro is called instead in an attempt to improve the amount of clock cycles it takes to complete the interrupt.

Is it correct to say that the compiler simply takes the code behind the macro and substitutes it where it's called in the interrupt. So no additional stack work is done, like what happens when you call a function?

Timing in this situation is critical due to the interrupt is being driven at a maximum of 320-kHz. This is indeed required by the specific application.

My initial thought is that the assertions of ST's CMSIS library are very slow. ST's libraries assert everything, even on the most simple GPIO_SetBits functions, in order to check if the port specified is a GPIO port, etc.

In general I consider assertions to be a good thing, but they can be a bit annoying when performance gets in the way. You could try disabling assertions (which is often useful in e.g. release mode) and see if that speed things up.

If so I would rewrite the functions in your ISR so you can keep assertions turned on in the rest of your project.
If not then I would probably try to profile (manually) what code takes so many cycles to execute and reconsider the approach taken.

In terms of the choice between macro's and functions: always prefer functions when you can. Macro's are horrible to debug and very error prone way of writing code. As demonstrated a single semicolon, brackets inside macro arguments or forgetting braces can introduce 'invisible bugs' because the actual compiled code is not transparant anymore. Debugging that via the C preprocessor output is not my definition of fun. This makes existing macro's almost "work of magic" that you never want to change again, because you need to be careful with using and writing them.
If you're using GCC you could use -O3 and write it with a few short explicit functions instead. If you want to keep debugability in the rest of your project, you could consider using a few pragma's (or better: int foo(int i) __attribute__((optimize("-O3"))) ) around the ISR functions. This should considerably 'streamline' the assembly for speed.

#24 Reply
Posted by Pack34 on 20 Oct, 2016 21:52
Quote from: hans on 20 Oct, 2016 20:11
If you want to know it more accurately, you could set up a internal timer ready-to-go in your main code. Then clear&start it when you enter the ISR and stop it when you leave it. The timer value then contains the cycles (or multiple of) taken which can be read while running your main code.
Alternatively you could set/clear a GPIO upon entering/leaving the ISR and measure it's timing with a scope or LA.
You could obviously also narrow this down around particular parts of the code.

I did. It was the function call. Copy and pasted the contents of the function into the interrupt and the time-sink went away.