Author Topic: [ARM] optimization and inline functions  (Read 7702 times)

0 Members and 2 Guests are viewing this topic.

Online SimonTopic starter

  • Global Moderator
  • *****
  • Posts: 17821
  • Country: gb
  • Did that just blow up? No? might work after all !!
    • Simon's Electronics
[ARM] optimization and inline functions
« on: January 18, 2022, 09:36:27 am »
I'm playing. So I wrote a test project for a SAMC. I am toggling a port pin in a loop - the hello world project.

So  on a 4MHz clock with a static inline function:
with no optimization I get a symmetrical 50% duty output at 45kHz,
With optimization 1 I get 666kHz 50% duty output
with optimization 2 I get 1MHz but now it's a 25% duty
With optimization 3 I get the same

With a non inline-d function:
with no optimization I get a symmetrical 50% duty output at 45kHz,
With optimization 1 I get 111kHz 50% duty output
with optimization 2 I get 111kHz 50% duty output
With optimization 3 I get 111kHz 50% duty output

So actually forcing the inlining makes the biggest gain, but surely if the compiler is optimizing hard it is looking at size of code, execution time and time to call the function and it would inline automatically.

I'm a little confused about the 25% duty, the instructions to go low and high are identical, it's like the instructions are prefetched in two's into some sort of father cache? it's only an M0+
 

Online eutectique

  • Frequent Contributor
  • **
  • Posts: 393
  • Country: be
Re: [ARM] optimization and inline functions
« Reply #1 on: January 18, 2022, 10:48:12 am »
Would you share the generated assembly code?
 

Offline Nominal Animal

  • Super Contributor
  • ***
  • Posts: 6266
  • Country: fi
    • My home page and email address
Re: [ARM] optimization and inline functions
« Reply #2 on: January 18, 2022, 10:56:51 am »
le confused about the 25% duty, the instructions to go low and high are identical
Including the loop jump?

If you use something like
    while (keep_generating) {
        GPIOn_PSOR = 1 << GPIOn_BIT;  /* Set output high */
        GPIOn_PCOR = 1 << GPIOn_BIT;  /* Set output low */
    }
then the pin stays high for the duration of the single instruction, but low for the duration of the loop jump (at least one clock cycle on Cortex-M0+) and whatever the keep_generating test takes, if it is not always true.

If your code is
    while (pulses-->0) {
        GPIOn_PSOR = 1 << GPIOn_BIT;  /* Set output high */
        GPIOn_PCOR = 1 << GPIOn_BIT;  /* Set output low */
    }
then in the optimized Thumb code for Cortex-M0+, the output would be high for one clock cycle, but low for three (one for the assignment/store, one for decrementing the pulse count in a register variable, and one for the branch), giving a 25% duty cycle.

The tightest GPIO toggle loop with 50% duty without unrolling uses the GPIO TOGGLE register (GPIOn_PTOR in NXP KL26 family Cortex-M0+ controllers), using
    while (keep_generating) {
        GPIOn_PTOR = 1 << GPIOn_BIT;
    }
which, if keep_generating always evaluates to True, and GPIOn_BIT is 0, compiles to the following Thumb assembly on Cortex-M0+:
        ldr     rA, .address
        movs    rB, 1
    .loop:
        str     rB, [rA]
        b       .loop
    .address:
        .long   2684354628
where rA and rB are some general registers, often r0 and r1.

The TOGGLE register is useful in that the bit mask on the right side can toggle any set of bits in the same GPIO port, and if they are initialized to opposite output states, they will also toggle in opposite states.  For more complex sequences, you can use an array of toggle masks, say
Code: [Select]
void toggle32(volatile uint32_t *reg, uint32_t mask[], uint32_t masks, uint32_t pulses)
{
    while (pulses-->0) {
        uint32_t *const  ends = mask + masks;
        uint32_t        *curr = mask;

        while (curr < ends) {
            *reg = *(curr++);
        }
    }
}
where masks is usually even, and (mask[0] ^ mask[1] ^ ... ^ mask[masks-2] ^ mask[masks-1]) == 0, i.e. the full toggle sequence returns the GPIO outputs to their original states.  That sequence is then repeated pulses times, but note that there is a "delay" (due to the outer loop and inner loop setup) after the last toggle.  If you have the memory, and especially if you can limit to a single byte within the toggle register (at least NXP KL26 allows 8-bit, 16-bit and 32-bit accesses to the register), you can do pretty complicated digital waveform patterns on the output at a pretty darned high rate.
« Last Edit: January 18, 2022, 11:01:54 am by Nominal Animal »
 

Online SimonTopic starter

  • Global Moderator
  • *****
  • Posts: 17821
  • Country: gb
  • Did that just blow up? No? might work after all !!
    • Simon's Electronics
Re: [ARM] optimization and inline functions
« Reply #3 on: January 18, 2022, 11:30:39 am »
I am using toggle, so it's the same instructions all round, I guess the initial state is 0 so it looks like it is faster to flip low to high than it is to flip high to low.
 

Offline MK14

  • Super Contributor
  • ***
  • Posts: 4540
  • Country: gb
Re: [ARM] optimization and inline functions
« Reply #4 on: January 18, 2022, 01:18:09 pm »
The peripheral clock, might be running slower than the main cpu clock. So because of aliasing effects, it can do strange things. I'm not sure of the part number (MCU), so I can't check the datasheet, to see a possible divider ratio.
« Last Edit: January 18, 2022, 01:23:04 pm by MK14 »
 

Offline Nominal Animal

  • Super Contributor
  • ***
  • Posts: 6266
  • Country: fi
    • My home page and email address
Re: [ARM] optimization and inline functions
« Reply #5 on: January 18, 2022, 01:56:08 pm »
I am using toggle, so it's the same instructions all round, I guess the initial state is 0 so it looks like it is faster to flip low to high than it is to flip high to low.
If you have an oscilloscope, it would be very interesting to see the result if you initialize two outputs (say, consecutive bits in the same GPIO register) to different states, and then toggle both of them, using either one as the trigger.  (Just write a two-bit mask, instead of an one-bit mask, to the toggle register.)

I would expect them to change in tandem, and the asymmetric delay be explained by the assembly code (that you can see with the -S flag, or by using godbolt.org and selecting the correct compiler and setting the compiler options).  However, if the toggles indeed are asymmetric, it would be very useful and interesting to see exactly how; whether it is just a pin output driving issue (going high being slower than going low), or something else.

(In case anyone reading this is not aware, each GPIO port on Atmel/Microchip SAMC (and NXP KL26 family) has four registers that define the output pin states in that port: the data register, a set register, a clear register, and a toggle register.  Only the data register is read-write, the set/clear/toggle are write-only.  Writing a value to the set/clear/toggle register will set/clear/toggle the bits in the output state register that were set in the value.  So, writing 0x00000001 to the clear register will set bit 0 in the data register to zero, but leave all other bits unset.  The register names and the macros/variables to access them do vary a bit, but the logic is the same.)

Which reminds me:

If this is a SAMC20 or SAMC21, make sure you access the toggle register using the IOBUS; the address range for the port registers via the IOBUS being between 0x60000000 and 0x60000200.  Via the slower AHB-AHP bridge B, at 0x41000000 to 0x41000200, the latency depends on the AHP clock 0.  Thus, you'll want to use
    #define  PORTA_TOGGLE  (*(volatile uint32_t *)0x6000021C)
    #define  PORTB_TOGGLE  (*(volatile uint32_t *)0x6000029C)
    #define  PORTC_TOGGLE  (*(volatile uint32_t *)0x6000031C)
for 32-bit accesses to the toggle register, to make sure you use the single-cycle IOBUS, and not the slower AHB-AHP bridge B:
Code: [Select]
    while (counter-->0) {
        *(volatile uint32_t *)0x6000021C = (1<<3) | (1<<6);  /* Toggle PA03 and PA06 */
    }
 
The following users thanked this post: MK14

Online NorthGuy

  • Super Contributor
  • ***
  • Posts: 3147
  • Country: ca
Re: [ARM] optimization and inline functions
« Reply #6 on: January 18, 2022, 03:05:43 pm »
So  on a 4MHz clock with a static inline function:
with optimization 2 I get 1MHz but now it's a 25% duty

1 clock high, 3 clocks low. It is probably doing some sort of loop unrolling. You need to post the disassembler.
 
The following users thanked this post: MK14

Online SimonTopic starter

  • Global Moderator
  • *****
  • Posts: 17821
  • Country: gb
  • Did that just blow up? No? might work after all !!
    • Simon's Electronics
Re: [ARM] optimization and inline functions
« Reply #7 on: January 18, 2022, 03:26:41 pm »
The peripheral clock, might be running slower than the main cpu clock. So because of aliasing effects, it can do strange things. I'm not sure of the part number (MCU), so I can't check the datasheet, to see a possible divider ratio.

it's a bog standard SAMC start with default clock, I think everything gets 4MHz
 
The following users thanked this post: MK14

Offline MK14

  • Super Contributor
  • ***
  • Posts: 4540
  • Country: gb
Re: [ARM] optimization and inline functions
« Reply #8 on: January 18, 2022, 03:47:29 pm »
it's a bog standard SAMC start with default clock, I think everything gets 4MHz

Then it's a 1 to 1 clock ratio (no clock prescalers), assuming you're not using the slower buses, mentioned by Nominal Animal, in an earlier post. So, it should be fine, as regards port clock frequency.
I didn't mention seeing the assembly listing, as others had already done that.
 

Offline SiliconWizard

  • Super Contributor
  • ***
  • Posts: 14490
  • Country: fr
Re: [ARM] optimization and inline functions
« Reply #9 on: January 18, 2022, 05:52:31 pm »
So, someone is apparently still trying to predict execution time on a modern ARM MCU down to the cycle and still wonders why it doesn't work.  :popcorn:
 
The following users thanked this post: hans, grumpydoc, mikerj

Online SimonTopic starter

  • Global Moderator
  • *****
  • Posts: 17821
  • Country: gb
  • Did that just blow up? No? might work after all !!
    • Simon's Electronics
Re: [ARM] optimization and inline functions
« Reply #10 on: January 18, 2022, 07:38:30 pm »
it's a bog standard SAMC start with default clock, I think everything gets 4MHz

Then it's a 1 to 1 clock ratio (no clock prescalers), assuming you're not using the slower buses, mentioned by Nominal Animal, in an earlier post. So, it should be fine, as regards port clock frequency.
I didn't mention seeing the assembly listing, as others had already done that.

default is that the internal clock source is divided to 4MHz for the entire chip as it's the oscillator that is being prescaled not any of the other stuff down the line.
 
The following users thanked this post: MK14

Online SimonTopic starter

  • Global Moderator
  • *****
  • Posts: 17821
  • Country: gb
  • Did that just blow up? No? might work after all !!
    • Simon's Electronics
Re: [ARM] optimization and inline functions
« Reply #11 on: January 18, 2022, 07:42:38 pm »
So, someone is apparently still trying to predict execution time on a modern ARM MCU down to the cycle and still wonders why it doesn't work.  :popcorn:


actually my biggest realization is how misguided people are who just throw code at a compiler and say: modern compilers are so good that they will deal with it and optimize it. I don't know how the optimization works but it is obviously not good enough to take the single line of code out of a function and inline it as with 256kB of flash and just the few kB at most of code speed optimization is a given.
 

Online SimonTopic starter

  • Global Moderator
  • *****
  • Posts: 17821
  • Country: gb
  • Did that just blow up? No? might work after all !!
    • Simon's Electronics
Re: [ARM] optimization and inline functions
« Reply #12 on: January 18, 2022, 07:56:59 pm »
It's also interesting that I get the 25% duty when I use the inline and 50% with non inline. I sort of assume that given that running at 9x the speedup with in-lining and down to doing it in 2 cycles on average per loop maybe there is caching and starting at low the first two instructions are done fast in one cycle and then it takes another 2 cycles for the fetching to catch up. I'm not relying on these timings for anything, it's just interesting to see how the inlining optimizes beyond what optimization can do.
 

Offline Siwastaja

  • Super Contributor
  • ***
  • Posts: 8180
  • Country: fi
Re: [ARM] optimization and inline functions
« Reply #13 on: January 18, 2022, 08:01:12 pm »
ARM assembly is surprisingly simple, so I suggest you enable generation of assembly files; on GCC, it's -S flag, so instead of
gcc file.c -o file.o, use
gcc file.c -S -o file.s

Then look at the generated code, and you'll learn even more.
 
The following users thanked this post: MK14

Offline ajb

  • Super Contributor
  • ***
  • Posts: 2608
  • Country: us
Re: [ARM] optimization and inline functions
« Reply #14 on: January 18, 2022, 08:43:46 pm »
It's also interesting that I get the 25% duty when I use the inline and 50% with non inline. I sort of assume that given that running at 9x the speedup with in-lining and down to doing it in 2 cycles on average per loop maybe there is caching and starting at low the first two instructions are done fast in one cycle and then it takes another 2 cycles for the fetching to catch up. I'm not relying on these timings for anything, it's just interesting to see how the inlining optimizes beyond what optimization can do.

Well, there's a fair amount of housekeeping that comes with entering a function and then returning.  Registers need to be saved to the stack, and there's a jump to the function entry point, which may cause a pipeline flush as well as a delay in retrieving new instructions from flash depending on the prefetch behavior, flash speed relative to the core, and whether or not the jump is conditional.  If the function is not inlined then the instructions in the function itself must not make any assumptions about the context in which they occur, so that also takes more time.  All of that can easily swamp the additional jump required to close the main loop in a program, which is why you see something much closer to 50% (but almost certainly not exactly 50%) with an inlined function.
 

Offline T3sl4co1l

  • Super Contributor
  • ***
  • Posts: 21701
  • Country: us
  • Expert, Analog Electronics, PCB Layout, EMC
    • Seven Transistor Labs
Re: [ARM] optimization and inline functions
« Reply #15 on: January 18, 2022, 08:58:12 pm »
Inlining is something of a non sequitur, anyway.  The compiler treats inline as a recommendation, not a requirement.  On various -O's, inlining is attempted even if you don't specify it, and if it doesn't meet various optimization goals or semantic requirements, will be rejected even if you do.

For example, my typical programs all end up with monstrous main() routines, because all the one-off init and do-this and do-that functions all end up in one big pile.  It's structured that way for my readability and flexibility; the compiler knows to disregard that structure and cram it all together, saving both time and space.  (I forget what all options enable this; it may do it for example within a compilation unit (main.c contains main() which calls foo() and bar() in the same file) but not across files (main() calls baz() from baz.c), on the assumption that those functions may need to be independent for linkage purposes (what if bam.c happens to call foo() as well?).  In any case, -flto (link-time optimization) says to cram them all together if useful, don't worry about leaving references.)

(And by "references", I mean a usable pointer to a function.  An inlined function is stripped of pre/postamble (i.e. saving/restoring registers on the stack, return instruction, etc.), so cannot be called from anywhere at random.  The compiler needs to know if you're dereferencing a function pointer, so that it will either avoid inlining that function at all (more likely -Os), or leave a free copy in addition to the inlined version(s) (more likely -O3).)

That said, it's interesting that, evidently your case isn't inlining on any -O.  You didn't show your code so it's impossible to tell why.  Perhaps you used the set/reset pair Nominal Animal led with, and the compiler sees the duplication and declines to inline it for that reason (it'll seemingly take more time/space to double it up)?  Perhaps the inlining is done at an earlier stage -- before other optimizations have been applied -- and therefore it doesn't realize the true gain of this operation.  (Mind also that GCC optimizes its internal representation, not the final target output.  Which has fairly embarrassing results on something ill-fitting like AVR, but should be pretty good on ARM.)


ARM assembly is surprisingly simple, so I suggest you enable generation of assembly files; on GCC, it's -S flag, so instead of
gcc file.c -o file.o, use
gcc file.c -S -o file.s

Then look at the generated code, and you'll learn even more.

I'm fond of using this for assembly generation:
Code: [Select]
objdump -h -S project_output.elf > Release/project_output.lssI put that in the project post-build options (among other things), and this generates the commented (sometimes) output, along with everything else.  (Commented with the source, that is, if optimization is low, and symbols are preserved.  On -O3 or -Os, it's just the raw listing with only labels (symbols).)

Tim
Seven Transistor Labs, LLC
Electronic design, from concept to prototype.
Bringing a project to life?  Send me a message!
 

Offline emece67

  • Frequent Contributor
  • **
  • !
  • Posts: 614
  • Country: 00
Re: [ARM] optimization and inline functions
« Reply #16 on: January 18, 2022, 09:42:21 pm »
.
« Last Edit: August 19, 2022, 05:09:38 pm by emece67 »
 

Offline westfw

  • Super Contributor
  • ***
  • Posts: 4199
  • Country: us
Re: [ARM] optimization and inline functions
« Reply #17 on: January 18, 2022, 10:04:42 pm »
Please show us your actual code.

 
The following users thanked this post: mikerj, MK14

Online hans

  • Super Contributor
  • ***
  • Posts: 1641
  • Country: nl
Re: [ARM] optimization and inline functions
« Reply #18 on: January 18, 2022, 11:09:49 pm »
Don't want to repeat what others have said: but if you really want to dive into this problem and know what a compiler is doing.. post the code.. post the assembly listing. There is even a website for reviewing compiled code with many many GCC/CLang compilers, including crosscompilers: https://godbolt.org/

Here is an example to a simple toggle I/O program (which I assume is similar to what a blinky you've made -- but this is completely assumption based): https://godbolt.org/z/KrzYE3ffx
Code: [Select]
#include <cstdint>

volatile uint32_t* port = reinterpret_cast<uint32_t*>(0xCAFECAFE);

int toggle() {
    while (1) {
        *port = 1;
    }
}

On the Godbolt link you can play around with the compiler setting. The assembly output on O2 is:
Quote
toggle():
        ldr     r3, .L4
        ldr     r2, [r3]
        movs    r3, #1
.L2:
        str     r3, [r2]
        str     r3, [r2]
        b       .L2
.L4:
        .word   .LANCHOR0
port:
        .word   -889271554

Label L2 is the while loop. In the compilers infinite wisdom, it has unrolled the loop twice. You could try compiling for size (-Os in GCC), and see that it then only does 1 store to save on code size. I presume you then get a 50% 667kHz square wave again, as a 1 store (1 cycle)+1 branch(2 cycle), with 2 toggles, results in 6 cycles. 4M/6=667kHz.

I do agree with SiliconWizard though; it gets incrementally harder to predict how CPUs run code these days. Seemingly simple code can even lead to odd behaviour. This story is perhaps a bit off-topic on Simon's program, but something I encountered last week that was also heavily confusing with compiler settings.

I was debugging a timebase. A hardware timer is extended in software with a 64-bit integer value. The live timestamp is calculated with the hardware and software value added together. Example: https://godbolt.org/z/76GGM8qT7

The timer runs at high speed, which is the reason to use 64-bit ints. This code looks quite simple and works... The code waits till multiples of 0x100000 cycles have passed. If 'target' is a multiple of 0x100000, then obviously (N * 0x100000)&0x10000 will always be 0? Not quite.. this code will hang at some point on C-code line 25. If you look in the assembly listing at line 5 and 6, there is a 1 cycle opportunity for register `CNT` to be loaded as 0xFFFF on line 5, which would overflow and fire the IRQ before line 6. Now the variable time is loaded, which was just incremented by 0x10000, and so 0xFFFF has been counted twice. 1 instruction doesn't sound like a big deal. What are the odds?  Well.....

- I could add (one) extra line of debug or computation statement in foo() (e.g. toggle a LED), and the problem would disappear (or reappear).
- I could shift some lines of code in foo() around, and the problem would disappear or reappear.
- I could enable the CPU cache (was a Cortex-m7 chip), and the problem would disappear, mostly.
- I could flush the instruction buffer, and the problem would appear again, mostly.
- I could influence the problem with compiler settings, as well, with varying results.

It's like probing your circuit with your oscilloscope will stop it resonate. The actual fix was to deal with the race condition on variable time.

The morale of my story is to not expect the compiler to be some magic tool that fixes your code or will always make it run fast, or the way you want it to run. In my example, if it would have loaded `time` before CNT, then this problem would not occur. But since volatile is not "magic", and is allowed to be reordered (among other non-volatile values that are manipulated), you can get some annoying bugs. An more aggressive (optimizations) compiler can lead to weird-er behaviour. The worst of all is that from the C-code perspective everything looks like predictable C code (as in: trivial statements), but how it compiles and runs on the hardware varies a lot (especially on the m7 with caches, dual issue, etc.). So best to look at assembly output, and assume the worst will at some point happen.
« Last Edit: January 18, 2022, 11:15:19 pm by hans »
 
The following users thanked this post: MK14

Offline westfw

  • Super Contributor
  • ***
  • Posts: 4199
  • Country: us
Re: [ARM] optimization and inline functions
« Reply #19 on: January 19, 2022, 12:57:49 am »
Quote
In the compilers infinite wisdom, it has unrolled the loop twice.
That's a particularly odd choice, IMO...
 

Online nctnico

  • Super Contributor
  • ***
  • Posts: 26915
  • Country: nl
    • NCT Developments
Re: [ARM] optimization and inline functions
« Reply #20 on: January 19, 2022, 01:13:04 am »
ARM assembly is surprisingly simple, so I suggest you enable generation of assembly files; on GCC, it's -S flag, so instead of
gcc file.c -o file.o, use
gcc file.c -S -o file.s

Then look at the generated code, and you'll learn even more.
I agree. Only with the actual code, resulting assembly and compiler / version used there is something useful to write towards the question.
There are small lies, big lies and then there is what is on the screen of your oscilloscope.
 

Online NorthGuy

  • Super Contributor
  • ***
  • Posts: 3147
  • Country: ca
Re: [ARM] optimization and inline functions
« Reply #21 on: January 19, 2022, 01:30:16 am »
... it gets incrementally harder to predict how CPUs run code these days. Seemingly simple code can even lead to odd behaviour. This story is perhaps a bit off-topic on Simon's program, but something I encountered last week that was also heavily confusing with compiler settings.

I don't think this example has anything to do with C compilers or with modern CPUs. The same thing can happen on very old 8-bit CPUs as well, and even if you write in assembler. Actually, it is more likely to happen on old 8-bitters because timers are shorter.

Moreover, you can get similar effect even without ISR. For example, an 8-bit MCU may have a 16-bit timer consisting of two 8-bit registers - TIMERL and TIMERH. To get the 16-bit value you need to read them both. Since both are controlled by the hardware, the overflow from TIMERL and TIMERH may happen after you read  TIMERL, but before you read TIMERH, so the result will be off by 255.

Most timer modules are aware of this and handle such cases by latching TIMERH every time you read TIMERL, and then giving you the latch every time you read TIMERH. For this to work, you need to read TIMERL before TIMERH, but if you write in C:

Code: [Select]
time = TIMERL | (TIMERH << 8);
where the order of fetching TIMERL and TIMERH is undetermind, you may get garbage. Finding such thing may be difficult. Not only the errors are rare, but the error may go away simple because the compiler changed the reading order after another recompile.

These are very old things. Any embedded programmer must be aware of them, whether old or new.

PS. Sorry, this is probably off topic.
 
The following users thanked this post: hans, newbrain

Online NorthGuy

  • Super Contributor
  • ***
  • Posts: 3147
  • Country: ca
Re: [ARM] optimization and inline functions
« Reply #22 on: January 19, 2022, 01:35:19 am »
Quote
In the compilers infinite wisdom, it has unrolled the loop twice.
That's a particularly odd choice, IMO...

Makes it much faster at the expense of a single extra command.
 

Online SimonTopic starter

  • Global Moderator
  • *****
  • Posts: 17821
  • Country: gb
  • Did that just blow up? No? might work after all !!
    • Simon's Electronics
Re: [ARM] optimization and inline functions
« Reply #23 on: January 19, 2022, 09:05:26 am »
main.c
Code: [Select]

#include "pinControl.h"

int main(void)
{
    /* Initialize the SAM system */
    SystemInit();
makePinOutput(PB9);


    /* Replace with your application code */
    while (1)
    {
pinT(PB9) ;
    }
}

pinControl.h
Code: [Select]
static inline void makePinOutput(uint8_t x)
{
( ( *(volatile uint32_t * )( PORTS_BASE + (x >> 5) * PORTn_offset + PORTn_DIRSET ) ) = 0x01 << (x & PinNumberMask) );
}

static inline void pinT(uint8_t x)
{
( ( *(volatile uint32_t * )( PORTS_BASE + (x >> 5) * PORTn_offset + PORTn_OUTTGL ) ) = 0x01 << (x & PinNumberMask) );
}

From what Tim said it is more about the inlined functions being in header files rather than another c file that this c file is aware of.
 

Online SimonTopic starter

  • Global Moderator
  • *****
  • Posts: 17821
  • Country: gb
  • Did that just blow up? No? might work after all !!
    • Simon's Electronics
Re: [ARM] optimization and inline functions
« Reply #24 on: January 19, 2022, 09:14:22 am »
If I remove the inlining I get the same result, it's the fact that the code is in a header file that is seen as part as the same compilation unit.
 

Online SimonTopic starter

  • Global Moderator
  • *****
  • Posts: 17821
  • Country: gb
  • Did that just blow up? No? might work after all !!
    • Simon's Electronics
Re: [ARM] optimization and inline functions
« Reply #25 on: January 19, 2022, 09:32:49 am »
where exactly do I put the -s option for assembler output?
 

Online SimonTopic starter

  • Global Moderator
  • *****
  • Posts: 17821
  • Country: gb
  • Did that just blow up? No? might work after all !!
    • Simon's Electronics
Re: [ARM] optimization and inline functions
« Reply #26 on: January 19, 2022, 09:44:15 am »
Is this what you all wanted?
 

Online eutectique

  • Frequent Contributor
  • **
  • Posts: 393
  • Country: be
Re: [ARM] optimization and inline functions
« Reply #27 on: January 19, 2022, 10:28:19 am »
Is this what you all wanted?

Only this part:

Code: [Select]
void pinT(uint8_t x)
{
( ( *(volatile uint32_t * )( PORTS_BASE + (x >> 5) * PORTn_offset + PORTn_OUTTGL ) ) = 0x01 << (x & PinNumberMask) );
 1c0: 4a05      ldr r2, [pc, #20] ; (1d8 <main+0x30>)
 1c2: 6014      str r4, [r2, #0]
 1c4: 6014      str r4, [r2, #0]
 1c6: e7fc      b.n 1c2 <main+0x1a>
 

Offline MK14

  • Super Contributor
  • ***
  • Posts: 4540
  • Country: gb
Re: [ARM] optimization and inline functions
« Reply #28 on: January 19, 2022, 10:29:31 am »
Is this what you all wanted?

Well it is keeping me happy  8) .
So, it is doing (presumably, as only one version is listed) ONE or TWO of the port toggles (depending on compiler optimization levels, and if code is in a header or alternatively another c file), followed by an unconditional branch. If I'm interpreting things correctly, and seeing the right file, etc.
Hence the duty cycle is 50% for a single (port pin) toggle, but 25% vs 75%. Because the adjacent pair of port toggles, are quick, but the unconditional branch is slower (extra cycle(s) ), on an older generation arm series (M0+).
I'm still surprised it is 25% : 75%, rather than 33.33% : 66.66% duty cycles. But I suppose the branch could be especially slow, because of extra cycle count and/or pipeline delays (mentioned elsewhere).

There could be a GCC flag setting to change how much 'loop unrolling' occurs, which might put more port toggles into the code (or less), depending on the setting.
 

Online eutectique

  • Frequent Contributor
  • **
  • Posts: 393
  • Country: be
Re: [ARM] optimization and inline functions
« Reply #29 on: January 19, 2022, 10:40:23 am »
Code: [Select]
1c2: str r4, [r2, #0]  <-- toggle bit, 1 cycle
 1c4: str r4, [r2, #0]  <-- toggle bit, 1 cycle
 1c6: b.n 1c2           <-- jump back, 2 cycles

4 cycles in total, 25% duty cycle.
 
The following users thanked this post: MK14

Offline MK14

  • Super Contributor
  • ***
  • Posts: 4540
  • Country: gb
Re: [ARM] optimization and inline functions
« Reply #30 on: January 19, 2022, 10:46:32 am »
Code: [Select]
1c2: str r4, [r2, #0]  <-- toggle bit, 1 cycle
 1c4: str r4, [r2, #0]  <-- toggle bit, 1 cycle
 1c6: b.n 1c2           <-- jump back, 2 cycles

4 cycles in total, 25% duty cycle.

Thanks, that makes it nice and clear.

I am a bit confused as to why it takes 2 rather than 1 cycles for the unconditional branch. I thought it was only CONDITIONAL branches which took longer ?
N.B. Not disagreeing, just wondering why. Essentially it takes one more cycle for the unconditional branch, compared to a normal (1 cycle) instruction. I did read it might be to do with the pipeline, while making the answer for Simon.
I suppose it is because the pipeline needs an extra cycle, to know what the next instruction is, or something like that.
« Last Edit: January 19, 2022, 10:49:23 am by MK14 »
 

Offline Siwastaja

  • Super Contributor
  • ***
  • Posts: 8180
  • Country: fi
Re: [ARM] optimization and inline functions
« Reply #31 on: January 19, 2022, 10:48:32 am »
Yeah, and now running from flash vs. running from RAM might become a difference, too, if clock frequency is high enough so that flash needs wait states, prefetching multiple instructions so that "linear" code runs at full speed, but jumps require waiting for a flash access. Then again, some very simple cache system like ST's "ART accelerator" might make that jump happen in 1 cycle, after all.

where exactly do I put the -s option for assembler output?

-S, not -s, and this is on GCC command line. If using IDE, refer to its documentation how to generate assembly output. Starting a debugging session would be an obvious way to see the assembly and even single-step through it.
 
The following users thanked this post: MK14

Offline Nominal Animal

  • Super Contributor
  • ***
  • Posts: 6266
  • Country: fi
    • My home page and email address
Re: [ARM] optimization and inline functions
« Reply #32 on: January 19, 2022, 10:52:50 am »
To make it absolutely clear to those who do not understand the disassembly, the key part is that
Code: [Select]
1aa: 2480  movs  r4, #128 ; r4 = 0x80
     [unrelated stuff]
1b2: 00a4  lsls  r4, r4, #2 ; r4 = 0x200
     [unrelated stuff]
1c0: 4a05  ldr   r2, [pc, #20] ; r2 = 0x1d8

1c2: 6014  str   r4, [r2, #0]
1c4: 6014  str   r4, [r2, #0]
1c6: e7fc  b.n   1c2

1d8: 6000009c .word 0x6000009c
is equivalent to
Code: [Select]
    while (1) {
        *(volatile uint32_t *)0x6000009c = 0x200;
        *(volatile uint32_t *)0x6000009c = 0x200;
    }
in C.

I am a bit confused as to why it takes 2 rather than 1 cycles for the unconditional branch. I thought it was only CONDITIONAL branches which took longer ?
No: Conditional branches on Cortex-M0+ take 2 cycles if taken, 1 if not taken; unconditional branches take 2 cycles; unconditional branches with link take 3 cycles; and unconditional branches with link, or with link and exchange, take 2 cycles.  Slightly odd, but it's documented that way. 
« Last Edit: January 19, 2022, 10:56:09 am by Nominal Animal »
 
The following users thanked this post: MK14

Online eutectique

  • Frequent Contributor
  • **
  • Posts: 393
  • Country: be
Re: [ARM] optimization and inline functions
« Reply #33 on: January 19, 2022, 10:54:00 am »
On M0+ cores, the conditional branches are executed in 1 cycle if not taken, and in 2 cycles if taken. True, because of pipeline.
 
The following users thanked this post: MK14

Offline MK14

  • Super Contributor
  • ***
  • Posts: 4540
  • Country: gb
Re: [ARM] optimization and inline functions
« Reply #34 on: January 19, 2022, 10:59:06 am »
I am a bit confused as to why it takes 2 rather than 1 cycles for the unconditional branch. I thought it was only CONDITIONAL branches which took longer ?
No: Conditional branches on Cortex-M0+ take 2 cycles if taken, 1 if not taken; unconditional branches take 2 cycles; unconditional branches with link take 3 cycles; and unconditional branches with link, or with link and exchange, take 2 cycles.  Slightly odd, but it's documented that way.

Thanks, I see. So, ironically conditional branches can actually be 1 cycle faster than unconditional ones. If you suitably re-arrange the software, so that the conditional branch (mostly) is NOT taken.
 

Offline brucehoult

  • Super Contributor
  • ***
  • Posts: 4040
  • Country: nz
Re: [ARM] optimization and inline functions
« Reply #35 on: January 19, 2022, 11:07:42 am »
Code: [Select]
1c2: str r4, [r2, #0]  <-- toggle bit, 1 cycle
 1c4: str r4, [r2, #0]  <-- toggle bit, 1 cycle
 1c6: b.n 1c2           <-- jump back, 2 cycles

4 cycles in total, 25% duty cycle.

Thanks, that makes it nice and clear.

I am a bit confused as to why it takes 2 rather than 1 cycles for the unconditional branch. I thought it was only CONDITIONAL branches which took longer ?
N.B. Not disagreeing, just wondering why. Essentially it takes one more cycle for the unconditional branch, compared to a normal (1 cycle) instruction. I did read it might be to do with the pipeline, while making the answer for Simon.
I suppose it is because the pipeline needs an extra cycle, to know what the next instruction is, or something like that.

Minimum size CPU cores such as the Cortex M0 or various tiny RISC-V cores (SiFive E20, PULP Zero-Riscy) dispense with such "wastes" of silicon area as branch prediction and caches. They somewhat compensate by using a 2 or 3 stage pipeline instead of 5 stage (or more), which limits the clock speed but also keeps the "branch mispredict penalty" (*every* branch, since they don't try to predict) down to 1 extra cycle.

It also means that -- contrary to what some are saying here -- it is, by design, very easy to write cycle-accurate code on them.
 
The following users thanked this post: hans, MK14

Online eutectique

  • Frequent Contributor
  • **
  • Posts: 393
  • Country: be
Re: [ARM] optimization and inline functions
« Reply #36 on: January 19, 2022, 11:08:25 am »
So, ironically conditional branches can actually be 1 cycle faster than unconditional ones. If you suitably re-arrange the software, so that the conditional branch (mostly) is NOT taken.

Yep, and gcc provides a function for this, __builtin_expect().

The use would be:

Code: [Select]
#define likely(x)      __builtin_expect(!!(x), 1)
#define unlikely(x)    __builtin_expect(!!(x), 0)

if (likely (condition)) {
    ...
} else {
    ...
}

 
The following users thanked this post: hans

Offline brucehoult

  • Super Contributor
  • ***
  • Posts: 4040
  • Country: nz
Re: [ARM] optimization and inline functions
« Reply #37 on: January 19, 2022, 11:09:45 am »
I am a bit confused as to why it takes 2 rather than 1 cycles for the unconditional branch. I thought it was only CONDITIONAL branches which took longer ?
No: Conditional branches on Cortex-M0+ take 2 cycles if taken, 1 if not taken; unconditional branches take 2 cycles; unconditional branches with link take 3 cycles; and unconditional branches with link, or with link and exchange, take 2 cycles.  Slightly odd, but it's documented that way.

Thanks, I see. So, ironically conditional branches can actually be 1 cycle faster than unconditional ones. If you suitably re-arrange the software, so that the conditional branch (mostly) is NOT taken.

That's still 1 cycle slower than an unconditional branch to the next instruction, which can be optimised out by the programmer/assembler/linker, thus taking 0 cycles.
 
The following users thanked this post: MK14

Offline MK14

  • Super Contributor
  • ***
  • Posts: 4540
  • Country: gb
Re: [ARM] optimization and inline functions
« Reply #38 on: January 19, 2022, 11:24:46 am »
That's still 1 cycle slower than an unconditional branch to the next instruction, which can be optimised out by the programmer/assembler/linker, thus taking 0 cycles.

True, you are right.
But I really meant if it was hand coded assembly code, and there were alternative ways of coding the inner most parts of a loop. One way, with the inner most being an unconditional branch, the other way being a conditional branch (but which is NOT normally taken).
The eventual loop back to the beginner of the loop, being another additional branch instruction.

The above explanation is probably NOT the best or most efficient/fastest, way of programming it. But just my way of explaining the concept.
« Last Edit: January 19, 2022, 11:28:10 am by MK14 »
 

Offline MK14

  • Super Contributor
  • ***
  • Posts: 4540
  • Country: gb
Re: [ARM] optimization and inline functions
« Reply #39 on: January 19, 2022, 11:44:51 am »
Minimum size CPU cores such as the Cortex M0 or various tiny RISC-V cores (SiFive E20, PULP Zero-Riscy) dispense with such "wastes" of silicon area as branch prediction and caches. They somewhat compensate by using a 2 or 3 stage pipeline instead of 5 stage (or more), which limits the clock speed but also keeps the "branch mispredict penalty" (*every* branch, since they don't try to predict) down to 1 extra cycle.

It also means that -- contrary to what some are saying here -- it is, by design, very easy to write cycle-accurate code on them.

Thanks, that makes a huge amount of sense. Cycle-accurate code (hence deterministic), is extremely useful for embedded applications, especially when bit-banging, but also for working out the latencies and checking for the worst case execution times and latencies.

It is amazing how small the 'tiny' RISC-V cores, have been able to be made. Like the 1-bit serial ones, not especially fast (slow), but fit into amazingly tight (FPGA) spaces.

I'm use to the luxury of later architectures (such as PCs), which can eliminate unconditional branches (assuming the compiler/linker etc, put them in the code), by using the out-of-order and/or instruction-pre-fetching queues. I.e. It can run as fast as if the unconditional branch, wasn't even there, in most cases.
« Last Edit: January 19, 2022, 11:47:22 am by MK14 »
 

Offline Siwastaja

  • Super Contributor
  • ***
  • Posts: 8180
  • Country: fi
Re: [ARM] optimization and inline functions
« Reply #40 on: January 19, 2022, 12:39:41 pm »
It also means that -- contrary to what some are saying here -- it is, by design, very easy to write cycle-accurate code on them.

Yes, it is quite simple. Cortex M7 has branch prediction (which can be turned off for execution of cycle-accurate routines!) and dual issue, which make this harder, but if we forget about the M7, simpler 32-bit ARMs are not that weird.

There are a few traps for young players, like flash wait states and prefetch, but otherwise than that - you definitely can write cycle-accurate code, like on AVR, and I have definitely done that once or twice, and I don't find it difficult.

Now, the point some are making is not as much of it being difficult, but more that it's simply not necessary, as often as it used to be. The point I repeatedly make when replying tggzzz's posts, in actual control applications (instead of academic interest), only absolute time matters. If something needs to happen at earliest in 1µs and at latest in 2µs, this might require assembly programming and cycle counting on AVR running at 8MHz, but is a breeze to do on Cortex-M7 @ 400MHz, using interrupts and configurable interrupt priorities, in simple C.
 
The following users thanked this post: hans

Offline brucehoult

  • Super Contributor
  • ***
  • Posts: 4040
  • Country: nz
Re: [ARM] optimization and inline functions
« Reply #41 on: January 19, 2022, 12:40:30 pm »
That's still 1 cycle slower than an unconditional branch to the next instruction, which can be optimised out by the programmer/assembler/linker, thus taking 0 cycles.

True, you are right.
But I really meant if it was hand coded assembly code, and there were alternative ways of coding the inner most parts of a loop. One way, with the inner most being an unconditional branch, the other way being a conditional branch (but which is NOT normally taken).
The eventual loop back to the beginner of the loop, being another additional branch instruction.

I'd struggle to think of a CPU, simple of not, on which it is faster to execute...

Code: [Select]
loop:
   // stuff x, possible empty
   Bcond exit
   // stuff y
   B loop
exit:

... than ...

Code: [Select]
  B entry
loop:
  // stuff y
entry:
  // stuff x
  Binvcond loop

In the rather common case where stuff x is in fact empty e.g. a common or garden while(){} loop, it's generally faster and no more code size to even do this ...

Code: [Select]
  Bcond exit
loop:
  // stuff y
  Binvcond loop
exit:
« Last Edit: January 19, 2022, 12:42:08 pm by brucehoult »
 

Offline MK14

  • Super Contributor
  • ***
  • Posts: 4540
  • Country: gb
Re: [ARM] optimization and inline functions
« Reply #42 on: January 19, 2022, 12:58:15 pm »
I'd struggle to think of a CPU, simple of not, on which it is faster to execute...

Sure, here we go. Always starts at PointA, always (except when carry==1) must finish at PointB. Assumes C(arry) flag is normally/mostly clear for this program extract.

Code: [Select]

PointA:
BEQ OtherStuff

DoSomeStuff....
BRA PointB  // Alternative scenario would be ... +2 Cycles // Either put this line in
             // BCS SomewhereToDoStuff    {Which will fall into OtherStuff:..} ... Or +1 Cycles // Or This line             

OtherStuff:....

PointB
// Checks the status of Carry flag and does something if it is set
BCS SomewhereToDoStuff



EDIT: That example above, is not especially convincing, let me try and explain it in words. The unconditional branch, only jumps and consumes 2 cycles.
But the conditional jump, only takes one cycle (when NOT taken), but also usefully checks the status of something (usually a flag), and so it can occasionally/rarely actually take the branch as well. All for the 1 cycle (2 if branch taken). So there is plenty of room for a speed up, as it is one less cycle, and extra functionality (an optional condition check).
tl;dr
It should be possible to get a speed up, even if I can't immediately create a complete and efficiently program that does, demonstrate it in that code section, above. But, at this stage, I won't rule out, it NOT being possible to gain a speed advantage.
« Last Edit: January 19, 2022, 01:35:06 pm by MK14 »
 

Online hans

  • Super Contributor
  • ***
  • Posts: 1641
  • Country: nl
Re: [ARM] optimization and inline functions
« Reply #43 on: January 19, 2022, 03:53:49 pm »
eBay auction: #
It also means that -- contrary to what some are saying here -- it is, by design, very easy to write cycle-accurate code on them.

Yes, it is quite simple. Cortex M7 has branch prediction (which can be turned off for execution of cycle-accurate routines!) and dual issue, which make this harder, but if we forget about the M7, simpler 32-bit ARMs are not that weird.

There are a few traps for young players, like flash wait states and prefetch, but otherwise than that - you definitely can write cycle-accurate code, like on AVR, and I have definitely done that once or twice, and I don't find it difficult.

Now, the point some are making is not as much of it being difficult, but more that it's simply not necessary, as often as it used to be. The point I repeatedly make when replying tggzzz's posts, in actual control applications (instead of academic interest), only absolute time matters. If something needs to happen at earliest in 1µs and at latest in 2µs, this might require assembly programming and cycle counting on AVR running at 8MHz, but is a breeze to do on Cortex-M7 @ 400MHz, using interrupts and configurable interrupt priorities, in simple C.

I still find writing cycle-accurate code a bit iffy. I can see where Bruce is coming from: the pipeline is short and a branch will only flush 1 stage, so it's easy to predict how fast code runs. FLASH accelerators/caches can be prevented by putting the critical routine in (TCM)RAM. Put variables in a different SRAM so there is no bus conflict on mem-load/store instructions. I can see how that will work out.. but when streamlining development I would have some concerns with that, some are technical.. some are process of developing I suppose.

First, the example of Simon demonstrates an >50x speed-up from 45kHz (no optimization), to 111kHz (no inlining) to 1MHz.. with a peak toggle rate of 2MHz, unfortunately we can't unroll the loop infinitely. An IRQ that must be handled with a short deadline (e.g. your 2us example) is probably not a problem on a fast MCU, but perhaps only when the code is compiled with some kind of optimization. Or written in assembler so that it is hardened against what the compiler does. It's not useful to have a program that breaks on different compiler settings, as also demonstrated with my race condition example.
(You could choose optimizations per source file or even function to mitigate this problem to some degree. And in some applications, like real-time control, setting breakpoints can be a sin, so having code work on multiple optimization settings is not necessary)

I like to keep my code portable as much as I can. I'm not going to port my application to a different MCU every week (although with these part shortages...), I do want to be able to test functional behaviour of the exact same code on a PC (unit tests) and on the actual MCU. Unit tests are hard to run for assembler code, unless one happens to have an emulator.

In addition, I'd rather bit-bang a protocol with DMA/timers/GPIOs/SPI and some glue logic. The DMA buffers that are read/written can be unit tested. However, timing is not possible to unittest on a PC, so therefore I prefer to map timing critical operations onto hardware peripherals.

It's OK not to agree with my concerns. Perhaps my approach has been influenced too much on regular software programming techniques/trends (like test-driven development and continuous integration systems). Or perhaps I haven't come across a project where a requirement needs to absolutely be squeezed on a small CPU, where there is no other way than doing the hard work directly in assembler with just the MCU and a programming cable. But if all possible, I like to make those tedious/error-prone design steps not necessary..
« Last Edit: January 19, 2022, 03:56:05 pm by hans »
 

Online NorthGuy

  • Super Contributor
  • ***
  • Posts: 3147
  • Country: ca
Re: [ARM] optimization and inline functions
« Reply #44 on: January 19, 2022, 03:57:12 pm »
It also means that -- contrary to what some are saying here -- it is, by design, very easy to write cycle-accurate code on them.

The absence of branch prediction doesn't make the CPU cycle-accurate. There are caches and bus contention. For example, the CPU may compete with DMA for bus cycles which may lead to unpredictable delays.

BTW: MIPS eliminates "branch prediction penalty" completely by using delay slots. This doesn't make it cycle-accurate neither.
 

Offline Siwastaja

  • Super Contributor
  • ***
  • Posts: 8180
  • Country: fi
Re: [ARM] optimization and inline functions
« Reply #45 on: January 19, 2022, 04:15:23 pm »
Well, microcontroller projects (the ones that actually control things, using peripherals etc.) just are not portable, we have to accept this. By trying to make them portable, we either sacrifice so much of the MCU capability that we just say "no" to the projects - or use FPGAs to implement them. Or, we are writing and verifying so much extra abstraction that the original project would have been manually ported ten times when the "portable" project finishes.

IMHO, the key to success is not to strive for ideal world that does not exist in MCUs. After you accept this, you can leverage the features that are there, for the low cost, and save a lot of time and money compared to rolling out an FPGA design (or using an esoteric, vendor-lock-in solution like the XMOS).

You need to make compromises regarding idealism. You have to accept that -O0 is not supposed to produce a working project, so... just don't use it. You need to rely on compiler optimizations, but only to a point, not for cycle accuracy. If you need cycle accuracy, you are on a special case, and you need to prove yourself and the others that other ways of doing it are even more difficult or expensive.

A simple example: you need to react to an analog event within 100 clock cycles, by setting a pin high. You set up the comparator registers, write the interrupt address to the vector table, enable the interrupt at highest priority, and as a very first operation on the ISR function, write to the GPIO register. 10 minutes of work. You test it, and it works perfectly, as expected.

Then you start to think about it. Interrupt entry latency is 12 cycles. Does the GPIO operation require loading an address to a register, from program memory? Am I running code out of flash and if yes, is this part of program memory beyond the prefetch range of the flash? Heck, even if the code in in ITCM, is the vector table in flash? If it is, does the core load the vector address in parallel to stacking the registers? Probably yes but do I need to fully read the Cortex-Mwhatever manual every time I do this?

And at some point, thinking about it changes to overthinking about it. The threshold depends on the margin you originally had.

But in the end, having the port register qualified volatile also means, the compiler cannot reorder the ISR so that the port write would be the last thing. Compiler is of course allowed to insert an unnecessary calculation of Pi before that, but why would it do that?

Finally, you measure the latency with an oscillosscope and see that it's actually taking 17 cycles +/- 1 cycle of jitter and once in a year, when a DMA transfer is triggered during full moon and Michael Jackson's Thriller is playing in the radio, it has 3 cycles of jitter(!!).

Now, tggzzz would say you have proven nothing. But realistically, what are the chances that this breaks down beyond 100 clock cycles when the GCC version updates?

I don't know. At the same time, I get work done. And so does everybody else who works like this. And I have never, ever in my life had an issue where a high-priority interrupt execution would have significantly changed in timing due to some seemingly unrelated change. A few cycles, sure!

And quite frankly, keeping a fixed GCC version during an embedded MCU project is the sane thing to do. This isn't high performance desktop computing requiring security updates. If the original microcontroller chip stays the same, if the code is verified to work within specifications with good margins, why would you suddenly update to a new compiler version during production?
« Last Edit: January 19, 2022, 04:19:38 pm by Siwastaja »
 
The following users thanked this post: nctnico, MK14

Online NorthGuy

  • Super Contributor
  • ***
  • Posts: 3147
  • Country: ca
Re: [ARM] optimization and inline functions
« Reply #46 on: January 19, 2022, 04:39:05 pm »
A simple example: you need to react to an analog event within 100 clock cycles, by setting a pin high. You set up the comparator registers, write the interrupt address to the vector table, enable the interrupt at highest priority, and as a very first operation on the ISR function, write to the GPIO register. 10 minutes of work. You test it, and it works perfectly, as expected.

But the comparator can set a pin high by itself, without any help from CPU, with much better latency.
 

Online nctnico

  • Super Contributor
  • ***
  • Posts: 26915
  • Country: nl
    • NCT Developments
Re: [ARM] optimization and inline functions
« Reply #47 on: January 19, 2022, 04:55:14 pm »
Well, microcontroller projects (the ones that actually control things, using peripherals etc.) just are not portable, we have to accept this. By trying to make them portable, we either sacrifice so much of the MCU capability that we just say "no" to the projects - or use FPGAs to implement them. Or, we are writing and verifying so much extra abstraction that the original project would have been manually ported ten times when the "portable" project finishes.

IMHO, the key to success is not to strive for ideal world that does not exist in MCUs. After you accept this, you can leverage the features that are there, for the low cost, and save a lot of time and money compared to rolling out an FPGA design (or using an esoteric, vendor-lock-in solution like the XMOS).

Finally, you measure the latency with an oscillosscope and see that it's actually taking 17 cycles +/- 1 cycle of jitter and once in a year, when a DMA transfer is triggered during full moon and Michael Jackson's Thriller is playing in the radio, it has 3 cycles of jitter(!!).

Now, tggzzz would say you have proven nothing. But realistically, what are the chances that this breaks down beyond 100 clock cycles when the GCC version updates?
Very small.

I agree with your pragmatic approach. IMHO products become way too fragile when they need to rely on cycle accurate execution. That is something you don't want to really care about when writing software in C for a product. Both from a development time perspective (NRE costs) and a maintainance / life cycle perspective (project handover to a different programmer). Cycle accurate execution is nice for esoteric tinkering but not for real world products.

In many cases there is a simple workaround possible that gives both flexibility in software timing and provides perfectly predictable timing for the hardware.
« Last Edit: January 19, 2022, 04:57:51 pm by nctnico »
There are small lies, big lies and then there is what is on the screen of your oscilloscope.
 

Online eutectique

  • Frequent Contributor
  • **
  • Posts: 393
  • Country: be
Re: [ARM] optimization and inline functions
« Reply #48 on: January 19, 2022, 05:08:31 pm »
And while we are talking about Cortex M0+, gcc, and code generation, it would be worthwhile to mention that gcc produces bloated code which will waste flash space and CPU cycles (and, hence, battery life):

https://community.nxp.com/t5/MCUXpresso-IDE/M0-M0-optimization-bug-in-GCC/m-p/653235

Just be warned.
 

Offline MK14

  • Super Contributor
  • ***
  • Posts: 4540
  • Country: gb
Re: [ARM] optimization and inline functions
« Reply #49 on: January 19, 2022, 05:12:55 pm »
Cycle accurate execution is nice for esoteric tinkering but not for real world products.

Cycle accurate execution, if used, is a pain in the neck to write. Even more of a pain, if the timing was bang on, but then you find bugs in the software, and have to somehow change/repair it, and yet keep the cycle totals, within bounds. Worse yet, the customer(s) requirements, legislation, cpu used etc change, somewhat forcing much/all of the work to be redone again.
You're right, it is best avoided, and often the hardware peripheral set, can create the accurate timing for you and/or rapid responses to interrupts, can make it 'good enough', including the jitter that introduces.

On the other hand, determining the worst case interrupt latencies (especially beyond measuring it over a period of time), might be easier if the architecture is basically cycle accurate. The more somewhat indeterminate mechanisms, such as cache hits, long pipelines, instructions with widely varying cycle time delays (e.g. some divide instructions, e.g. the more 1 bits set, the slower, depending on how divide works in that cpu), aforementioned DMA and other stuff. Can make it harder and harder to determine the worst case interrupt timing delays, hence more variable (jitter).
« Last Edit: January 19, 2022, 05:16:15 pm by MK14 »
 

Online NorthGuy

  • Super Contributor
  • ***
  • Posts: 3147
  • Country: ca
Re: [ARM] optimization and inline functions
« Reply #50 on: January 19, 2022, 05:40:47 pm »
IMHO products become way too fragile when they need to rely on cycle accurate execution. That is something you don't want to really care about when writing software in C for a product.

You cannot possibly rely on cycle accuracy if you write in C. The C compiler is the source of uncertainty by itself - next compilation may be different from the current one. Caches, prefetchers, and bus entanglements are other sources of uncertainty. Therefore, you need to have a margin. As in the Siwastaja's example - you get 17 cycles while anything below 100 cycles would do. These 17 cycles can grow some, but probably will never overgrow 100 cycles, so the project is safe. This example uses roughly 500% margin.

You only need cycle accuracy when you want to push the limit and eliminate the safety margin. For that you need zero uncertainty. This lets you increase performance dramatically, but is rarely needed, because there's no premium on doing things faster than required by the project.
 
The following users thanked this post: MK14

Online hans

  • Super Contributor
  • ***
  • Posts: 1641
  • Country: nl
Re: [ARM] optimization and inline functions
« Reply #51 on: January 19, 2022, 05:56:10 pm »
@Siwastaja: Complete portability is definitely an unicorn. But that doesn't make a BSP a bad idea to have. It's likely only a small portion of code ends up in there. I don't think most MCU requirements are anything extraordinary. Canonical implementations of SPI, I2C, UART, etc. can be assumed to all have a similar "interface" (in terms of software, that is). Anything outside of BSP can easily be written in a portable way, and be tested with more productive tools than an electronics bench + slow debug dongles.

I agree with you that fighting for the last cycle (or explanation thereof) quickly becomes chasing red herrings or your own tail. But there is a difference between allowing for some overhead and having timing-sensitive code that actually breaks.

Quote
But in the end, having the port register qualified volatile also means, the compiler cannot reorder the ISR so that the port write would be the last thing. Compiler is of course allowed to insert an unnecessary calculation of Pi before that, but why would it do that?

Finally, you measure the latency with an oscillosscope and see that it's actually taking 17 cycles +/- 1 cycle of jitter and once in a year, when a DMA transfer is triggered during full moon and Michael Jackson's Thriller is playing in the radio, it has 3 cycles of jitter(!!).

Now, tggzzz would say you have proven nothing. But realistically, what are the chances that this breaks down beyond 100 clock cycles when the GCC version updates?
The C/C++ specification isn't always as well defined. Some things are left over to compilers infinite wisdom. For example in my example with the race condition, the access of 2 volatile values was reordered within that 1 statement. Did that code rely on UB? Yes it probably did.

I agree that the chances of a GCC update making such a large difference is very very slim. But "academically speaking", I can see where tggzzz is coming from that nothing has been proven. Measurements only suggest that it's unlikely to happen.

On a philosophical tangent: it's odd that some industries (including automotive) have very strict requirements on only using cache-free microcontrollers, so that worst-case execution times can be accurately determined and therefore the firmware can be verified to work reliably in all cases. On the other hand people are also working on self-driving cars, that have such a plethora of unknowns and high complexity, that we'll have to accept that also these computer systems won't be 100% safe. Nonetheless, some people still put it forth as a requirement or expectation.
 

Offline MK14

  • Super Contributor
  • ***
  • Posts: 4540
  • Country: gb
Re: [ARM] optimization and inline functions
« Reply #52 on: January 19, 2022, 06:55:17 pm »
On the other hand people are also working on self-driving cars, that have such a plethora of unknowns and high complexity, that we'll have to accept that also these computer systems won't be 100% safe. Nonetheless, some people still put it forth as a requirement or expectation.

Even if the timings, and most other functionalities were bolted down, perfectly. We could never really be sure, how the self driving car, is going to perform/behave, given the numerous variations in real life situations, the self driving car will experience.
Posted somewhere (I vaguely remember it being here in the Tea thread), is a Tesla Car, self driving (without getting into arguments, as to if autopilot, is fully self-driving or not), whereby a lorry in front, had a large number of traffic lights, in the back of it. Because there is normally only a limited number of traffic lights, and traffic lights, don't usually drive off in front of the Tesla. The software went partly crazy, as it recognized the traffic lights, but got wildly confused. On some kind of autopilot status screen.

First link (Twitter Video) is the truck with the traffic lights in back, the second link is a youtube video about it getting mixed up with the Moon.

https://twitter.com/i/status/1400207129479352323

 

Online SimonTopic starter

  • Global Moderator
  • *****
  • Posts: 17821
  • Country: gb
  • Did that just blow up? No? might work after all !!
    • Simon's Electronics
Re: [ARM] optimization and inline functions
« Reply #53 on: January 19, 2022, 07:14:26 pm »
Well, microcontroller projects (the ones that actually control things, using peripherals etc.) just are not portable, we have to accept this. By trying to make them portable, we either sacrifice so much of the MCU capability that we just say "no" to the projects - or use FPGAs to implement them. Or, we are writing and verifying so much extra abstraction that the original project would have been manually ported ten times when the "portable" project finishes.

IMHO, the key to success is not to strive for ideal world that does not exist in MCUs. After you accept this, you can leverage the features that are there, for the low cost, and save a lot of time and money compared to rolling out an FPGA design (or using an esoteric, vendor-lock-in solution like the XMOS).

Which is why the arduino is a piece of shit!
 

Offline MK14

  • Super Contributor
  • ***
  • Posts: 4540
  • Country: gb
Re: [ARM] optimization and inline functions
« Reply #54 on: January 19, 2022, 07:30:06 pm »
Which is why the arduino is a piece of s**t!

On the one hand it has an unbelievably massive amount of ram, for such a small device (2K), making it especially suited for massive programming projects (sarcasm).

But on the other hand, as you imply, it is its huge compatibility (with itself), tremendous popularity, availability of the Arduino (IDE, open source and cheap clones) and huge eco system (massive range of libraries and example/completed software and hardware for it). It has been a massive success. Just a pity they couldn't of chosen, some kind of upcoming arm chip, instead of the AVR they use. I know you can now get arm based ones, but the huge inertia of pre-existing stuff. Means that it is somewhat unheard of (Arduino/Arm), compared to the usual AVR Atmel (Now Microchip) versions.
« Last Edit: January 19, 2022, 07:35:56 pm by MK14 »
 

Online SimonTopic starter

  • Global Moderator
  • *****
  • Posts: 17821
  • Country: gb
  • Did that just blow up? No? might work after all !!
    • Simon's Electronics
Re: [ARM] optimization and inline functions
« Reply #55 on: January 19, 2022, 07:44:22 pm »
As brief as my experience of programming is, every time I have to use an arduino I find myself highly frustrated by lack of access to things that I know the micro-controller has and take for granted but are unsupported in Arduino like interrupts. The only interrupts available are the pin interrupts. Given the laughable utility of the system, to have pin interrupts is over the top and is well, only useful for keeping some part of the system working when the gobshite that some libraries are, are running slow.

Lets face it, the best way to do code scheduling in arduino is get a PWM pin going and then setup an interrupt on it or another connected pin to trigger a scheduler. Any attempt at an arduino scheduler that I have seen is an incomplete mess because you can't have one scheduler working against the existing one and surprise surprise the one I looked at used the delay function still to flash the LED it was supposed to be flashing with the interrupt :palm:

Trying to use millis and micros proved a nightmare for me and I was trying to do things that normally I would simply do with a counter firing interrupts.....
 
The following users thanked this post: MK14

Online SimonTopic starter

  • Global Moderator
  • *****
  • Posts: 17821
  • Country: gb
  • Did that just blow up? No? might work after all !!
    • Simon's Electronics
Re: [ARM] optimization and inline functions
« Reply #56 on: January 19, 2022, 07:45:11 pm »
Oh and want a good one? apparently on the arduino in integer math 0/100 = 1!!!!!!!! and yes I had to put a check in the program that would only allow the calculation to go ahead if the numerator was more than 100.............
 
The following users thanked this post: MK14

Offline brucehoult

  • Super Contributor
  • ***
  • Posts: 4040
  • Country: nz
Re: [ARM] optimization and inline functions
« Reply #57 on: January 19, 2022, 08:32:57 pm »
It also means that -- contrary to what some are saying here -- it is, by design, very easy to write cycle-accurate code on them.

The absence of branch prediction doesn't make the CPU cycle-accurate. There are caches and bus contention. For example, the CPU may compete with DMA for bus cycles which may lead to unpredictable delays.

Cortex M0{+} doesn't have caches.

Many systems they are used in don't have DMA (you should know whether you have DMA or not)

Quote
BTW: MIPS eliminates "branch prediction penalty" completely by using delay slots. This doesn't make it cycle-accurate neither.

One or two early simple MIPS implementations with short pipelines and single-issue managed to cover the delay in fetching code on the branch-taken path by executing the instruction in the delay slot.

Later implementations with longer pipelines or dual or multiple issue or long latency RAM need branch prediction for performance just as much as anything else does.

Even with the simple short pipeline single-issue CPUs it's not always easy to find something useful for the branch delay slot -- something that is needed on both the branch taken and branch not taken paths, but that isn't needed to determine the branch condition -- and some large percentage of the time you end up with a NOP there, wasting time and program size.
 

Online nctnico

  • Super Contributor
  • ***
  • Posts: 26915
  • Country: nl
    • NCT Developments
Re: [ARM] optimization and inline functions
« Reply #58 on: January 19, 2022, 08:51:10 pm »
It also means that -- contrary to what some are saying here -- it is, by design, very easy to write cycle-accurate code on them.

The absence of branch prediction doesn't make the CPU cycle-accurate. There are caches and bus contention. For example, the CPU may compete with DMA for bus cycles which may lead to unpredictable delays.

Cortex M0{+} doesn't have caches.
Many have 'caches' in the form of flash accellerators (pre-fetch buffers) so the execution time of branches isn't 100% predictable.
There are small lies, big lies and then there is what is on the screen of your oscilloscope.
 
The following users thanked this post: hans

Offline SiliconWizard

  • Super Contributor
  • ***
  • Posts: 14490
  • Country: fr
Re: [ARM] optimization and inline functions
« Reply #59 on: January 19, 2022, 09:24:33 pm »
Yes. Some can reply that you can always (usually) force some piece of code to run from RAM to avoid Flash caching issues. Then, even without DMA, you may still have interrupts. Having to disable interrupts may or may not cause other issues... etc.

So anyway. Relying on exact cycle execution on those modern CPUs often looks like tinkering and is very rarely worth it. As I said numerous times, using appropriate peripherals to do the job is the way to go in most cases. Good thing that many of those "modern" MCUs do embed a lot of peripherals with which you can implement a lot of stuff. Sure, NXP's FlexIO (or RPi's PIO) is very flexible for this, but even with just timers, PWM, input/output compares, FMC... you can implement a lot of "bit banging" with accurate timings.

If you still need to really directly toggle GPIOs with some *minimal* delay - again it will be hard to make that accurate - there is often one thing you can do that many people do not think about: set the clock rate for the GPIO peripheral to an appropriate frequency. Most ARM Cortex-based MCUs allow this. So you could have the core run at, say, 100 MHz, and the GPIOs clocked at, say, 10 MHz (if this divider is available). Then, just toggling a GPIO with two consecutive instructions, without any explicit delay, will always occur at least 100 ns apart. Maybe obvious, but something to think about.
 
The following users thanked this post: hans, nctnico, MK14

Online NorthGuy

  • Super Contributor
  • ***
  • Posts: 3147
  • Country: ca
Re: [ARM] optimization and inline functions
« Reply #60 on: January 19, 2022, 09:28:45 pm »
One or two early simple MIPS implementations with short pipelines and single-issue managed to cover the delay in fetching code on the branch-taken path by executing the instruction in the delay slot.

Later implementations with longer pipelines or dual or multiple issue or long latency RAM need branch prediction for performance just as much as anything else does.

We're talking about "small" MCUs which don't have branch predictions, remember? This includes, for example, PIC32MZ running at 250 MHz. For these, the delay slot works very well.

Look, for example, at the OP's original post. His loop runs at 1 MHz on a 4 MHz MCU. If this was MIPS, it would run at 2 MHz - twice as fast. Thanks to the delay slot.

Even with the simple short pipeline single-issue CPUs it's not always easy to find something useful for the branch delay slot -- something that is needed on both the branch taken and branch not taken paths, but that isn't needed to determine the branch condition -- and some large percentage of the time you end up with a NOP there, wasting time and program size.

Usually you can find something quite easily. If you cannot, you don't have to use it - just put a NOP into there. It's certainly better than M0 where you cannot put any command into the delay slot because there is no delay slot at all. On M0, you get the delay whether you like it or not.
 

Offline westfw

  • Super Contributor
  • ***
  • Posts: 4199
  • Country: us
Re: [ARM] optimization and inline functions
« Reply #61 on: January 19, 2022, 11:40:37 pm »
Quote
Cortex M0{+} doesn't have caches.
This doesn't stop vendors from implementing caches "outside" of the "CPU Proper" (as part of the memory or flash system, for instance.)

The Raspberry Pi RP2040 has a 16k cache (sorely needed since it fronts a QSPI program store), and almost every M0 implementation I've seen that runs faster than 32MHz has some kind of "flash accelerator" that is more complicated than just a wide bus.  I've seen the time taken by cycle-counting "delay" functions vary wildly depending on exactly where they ended up in the code.  (though that was on a CM4 which apparently had both cache AND flash acceleration. (still not ARM-defined cache, though.))

Often the cache and/or accelerator behavior is not documented very well.
 
The following users thanked this post: MK14

Offline westfw

  • Super Contributor
  • ***
  • Posts: 4199
  • Country: us
Re: [ARM] optimization and inline functions
« Reply #62 on: January 19, 2022, 11:54:30 pm »
Quote
  By trying to make them portable, we ... sacrifice so much of the MCU capability
Quote
Which is why the arduino is a piece of shit!

Arduino certainly sacrifices much of the MCU capabilities in its provided "common function libraries."
But complaining about that discounts the VAST number of applications that do not in fact need any of the specialized MCU capabilities.
(And also the fact that you can still access those MCU capabilities with only slightly more trouble than it would have been to access them via bare metal programming or a vendor-specific "bloated intentionally to provide access to all features!" library.)

All my life I've watched features implemented to make efficient use of some special capability, that in time became deprecated and ignored in favor of simply using faster infrastructure (CPU, memory, bandwidth.)  Just the other night at a cisco Anniversary Pizza Party (virtual), we were talking about "remember how we thought doing video over Internet would be impossible without IP Multicast?"  Sigh.
 
The following users thanked this post: MK14, DiTBho

Offline mikerj

  • Super Contributor
  • ***
  • Posts: 3240
  • Country: gb
Re: [ARM] optimization and inline functions
« Reply #63 on: January 20, 2022, 12:33:47 am »
As brief as my experience of programming is, every time I have to use an arduino I find myself highly frustrated by lack of access to things that I know the micro-controller has and take for granted but are unsupported in Arduino like interrupts. The only interrupts available are the pin interrupts.

Untrue.  Only external interrupts are supported for the attachInterrupt() function (which uses function pointers), but regular avr-libc interrupt handlers can be included for any peripheral not already used by the arduino ecosystem e.g.

Code: [Select]
ISR(TIMER3_COMPA_vect)
{
/* Handle interrupt */
}

« Last Edit: January 20, 2022, 12:36:14 am by mikerj »
 

Offline Siwastaja

  • Super Contributor
  • ***
  • Posts: 8180
  • Country: fi
Re: [ARM] optimization and inline functions
« Reply #64 on: January 20, 2022, 07:54:34 am »
You cannot possibly rely on cycle accuracy if you write in C.

You can, if you put the timing critical part in its own module, adjust it until it matches the expectations, using a scope for example, keep it small, compile it once, verify the module, then just keep the object file.

For someone who doesn't feel confident writing assembly, this could be the easiest way. And if another option is to explode the $2 BOM to $50 by introducing an FPGA and then starting learning VHDL or hiring someone who can do it, I can see the appeal of using non-optimal tools to get to the goal. Sometimes you just use your screwdriver as a hammer
 
The following users thanked this post: MK14

Offline Siwastaja

  • Super Contributor
  • ***
  • Posts: 8180
  • Country: fi
Re: [ARM] optimization and inline functions
« Reply #65 on: January 20, 2022, 08:01:18 am »
Arduino, by their own definition, is not supposed to be used by programmers or electronics designers, but by artists. The whole idea is you can just buy a shield and write led.blink(); and get an art project out of it. It needs to be dumbed down, it needs to be limited. Art projects also don't have strict requirements so you can always work with what you have.

By trying anything more challenging than that, you hit the limits, and it's not Arduino's fault. If you want to blame someone, blame fanboys who don't understand the limits.

But instead, I suggest you just completely ditch the Arduino software ecosystem. You can still use the boards, just program them like you program the microcontroller on the board.
« Last Edit: January 20, 2022, 08:02:50 am by Siwastaja »
 

Online SimonTopic starter

  • Global Moderator
  • *****
  • Posts: 17821
  • Country: gb
  • Did that just blow up? No? might work after all !!
    • Simon's Electronics
Re: [ARM] optimization and inline functions
« Reply #66 on: January 20, 2022, 08:23:19 am »
Arduino, by their own definition, is not supposed to be used by programmers or electronics designers, but by artists. The whole idea is you can just buy a shield and write led.blink(); and get an art project out of it. It needs to be dumbed down, it needs to be limited. Art projects also don't have strict requirements so you can always work with what you have.

By trying anything more challenging than that, you hit the limits, and it's not Arduino's fault. If you want to blame someone, blame fanboys who don't understand the limits.

But instead, I suggest you just completely ditch the Arduino software ecosystem. You can still use the boards, just program them like you program the microcontroller on the board.

That is exactly what I tell people that come to me exited about the arduino. I inherited one arduino based project and my boss who initially came to me with this revelation that he had seen it in action has accepted my opinion and is not insisting I used it. Not to mention the fact tat, uh, all the SAMD micro's are out of stock, not sure why that is.......
 
The following users thanked this post: MK14

Offline Nominal Animal

  • Super Contributor
  • ***
  • Posts: 6266
  • Country: fi
    • My home page and email address
Re: [ARM] optimization and inline functions
« Reply #67 on: January 20, 2022, 09:07:47 am »
If you want cycle accuracy, C is indeed not the right tool.  Even when you go through the steps to get the binary that does the thing right, you need to keep using that binary (and not the C source), because any little change in the compiler or linker can/will throw the timing off.

Me, I like using GCC/IntelCC/Clang extended asm for such critical parts.  It differs from external assembly sources in that the extended asm construction explicitly tells the C compiler about the input and output registers (and if so wanted, even lets the C compiler choose the exact registers used), and what registers or memory gets clobbered.  When used in an inlined function, the compiler can adjust the assembly (register use) to best fit the surrounding code (and vice versa, because it knows exactly what registers etc. are used in the extended asm).

Things like timing-critical interrupt handlers are better written as external assembly files.  The C compiler really only needs its calling address, which is available at link time, and can be exported to C using a simple extern declaration.  (I like to rely on other ELF object file properties, like section attribute, to make the build machinery smoother, more capable, but still robust wrt. source code changes.)

For single-instruction multiple-data or SIMD stuff, I like using GCC/IntelCC/Clang vector extensions via the vector_size attribute (on the variable type).  Addition, subtraction, and multiplication on vector variables does the corresponding component-wise operation, and for the rest, the compiler provides built-in intrinsics (and a standardized set for x86-64 in <immintrin.h>).  The basic vector extensions work regardless of hardware support –– for example, when the hardware only supports say two components, the compiler uses two registers for a vector with four components, transparently and quite efficiently ––, so such code is actually portable across hardware.  For the same reasons as for extended asm, the compiler also generates quite efficient code for the intrinsics; typically much better than hand-written assembly, if we include the compiler-generated surrounding code in the consideration.

Arduino is a somewhat toy-like/silly/coddling environment, designed to make quick development easy for non-programmers and non-technical people.
As usual, a lot of Arduino stuff is quite crappy, but there are some very nice nuggets in there among the cores and support libraries.  So, it's not all bad.
I happily use it for quick prototyping, although I am quite familiar with the cores (Arduino core code for the specific microcontroller chip) and the libraries I use, as I look through their source codes before I rely on them even a tiny little bit.  (That is the difference between "I think there is a library for that" and "I use this library for that" in my case; involving several hours of source code examination.)

In all this, the ability to read and understand assembly code is quite useful and rather important.  If you are like me, you rarely write any assembly from scratch, but you end up reading (the important or strange or critical parts) of compiler-generated assembly at least once every week.  I myself often end up examining the generated assembly for wildly different hardware –– x86-64, MIPS, AVR, some ARM Cortex –– to find out if a particular C expression has issues on any hardware when compiled given a small set of compilers and compiler versions.

That is, I am basically never interested in what code ends up being optimal; I am interested in what code performs acceptably well in all situations, and what code patterns have issues with specific compilers and/or specific hardware architectures – the latter being more important than the former.  Even when I'm using the nonstandard C compiler and linker features above, I like to know what weaknesses the pattern I am applying, has.  That way, not only do I have that tool in my toolbox, but it also has a small note with it that lists the known risks/deficiencies/weaknesses in addition to its strengths, and I can rummage quickly through those to find a tool that suits a particular situation.

(From my point of view, this also explains why I dislike language-lawyerism: from this point of view, the language-lawyers are claiming the text of the standard is more important than those notes based on real life observations.  The standard is better than nothing, but can never override the behaviour observed in the real world.)

Whether the same approach works for others, depends on their personal strengths and what they get paid for, I believe.
 
The following users thanked this post: hans, MK14

Online SimonTopic starter

  • Global Moderator
  • *****
  • Posts: 17821
  • Country: gb
  • Did that just blow up? No? might work after all !!
    • Simon's Electronics
Re: [ARM] optimization and inline functions
« Reply #68 on: January 20, 2022, 11:07:20 am »
The SAMC has a cache for the non volatile memory controller, enabled by default. That may explain it.
 
The following users thanked this post: MK14

Offline josip

  • Regular Contributor
  • *
  • Posts: 152
  • Country: hr
Re: [ARM] optimization and inline functions
« Reply #69 on: January 20, 2022, 11:15:50 am »
The Raspberry Pi RP2040 has a 16k cache (sorely needed since it fronts a QSPI program store), and almost every M0 implementation I've seen that runs faster than 32MHz has some kind of "flash accelerator" that is more complicated than just a wide bus.  I've seen the time taken by cycle-counting "delay" functions vary wildly depending on exactly where they ended up in the code...

I am using Kinetis M0+ @96 MHz with cycle aligned code (bit banging), that is running from flash or RAM without issues. But I am codding in assembler. Branching also can be done right. Yes, undocumented things must be resolved by yourself. FlexIO is great, but some things can't be done with it.
« Last Edit: January 20, 2022, 11:17:25 am by josip »
 
The following users thanked this post: MK14

Offline DiTBho

  • Super Contributor
  • ***
  • Posts: 3915
  • Country: gb
Re: [ARM] optimization and inline functions
« Reply #70 on: January 20, 2022, 11:57:54 am »
Quote
BTW: MIPS eliminates "branch prediction penalty" completely by using delay slots. This doesn't make it cycle-accurate neither.

One or two early simple MIPS implementations with short pipelines and single-issue managed to cover the delay in fetching code on the branch-taken path by executing the instruction in the delay slot.

Later implementations with longer pipelines or dual or multiple issue or long latency RAM need branch prediction for performance just as much as anything else does.

I have worked with MIPSII, III, IV, including R10K, R12K, R14K and R16K: yup, definitively long pipeline MIPS, very different from R2K (5 stages)

MIPS64R2 is also a long pipeline MIPS (never seen yet a short one)
MIPS32R2 ... well, it depends on the chip-manufacturer. The Microchip's one is a short-pipeline version, the Atheros' one is a long-pipeline.

Even with the simple short pipeline single-issue CPUs it's not always easy to find something useful for the branch delay slot -- something that is needed on both the branch taken and branch not taken paths, but that isn't needed to determine the branch condition -- and some large percentage of the time you end up with a NOP there, wasting time and program size.

Yup, that's why the academic code for educational MIPS32-5-stages-pipeline is often stuffed with a NOP.
The opposite of courage is not cowardice, it is conformity. Even a dead fish can go with the flow
 
The following users thanked this post: MK14

Offline MK14

  • Super Contributor
  • ***
  • Posts: 4540
  • Country: gb
Re: [ARM] optimization and inline functions
« Reply #71 on: January 20, 2022, 01:14:51 pm »
If you want cycle accuracy, C is indeed not the right tool.

I like and appreciate the rest of the post, thanks for making it. But, I find the bit I quoted, misleading. Technically speaking, you are right, it is not the right tool. But real life is not that simple.
For various reasons, even though assembly code might well be the 'RIGHT' tool for the job, that is not necessarily an allowed option. It could be that it is a place of work, and most of the programmers, don't do assembly code, an open source project, and a decision has been made to leave assembly code out of it, and numerous other examples of why assembly code is either disallowed by rules (e.g. by the company that wants the software), or considered a very bad idea because different architectures might be used, either now or in the future.
As already mentioned in this thread, by a number of people, there are various suggestions on how to use C code, and get the timing accuracy you desire. So there are many ways of achieving it, such as compiling it, and looking at the assembly listing, and modifying it, if it doesn't appear as you intended, and/or watching/measuring what it does with a scope. Adjusting the code as necessary.
Although there are usually hardware timers, in usual MCUs. They can be tied up doing other things, or be unsuitable for extremely rapid transactions, and many other reasons why they can be unsuitable for the task in hand.

Increasingly these days, even assembly code itself, is NOT that simple as regards consistent/exact timing. Cache/other-stuff has already bean mentioned, but once you get to out of order execution (and hence it is already superscalar), e.g. Raspberry PI 4. Tiny changes to the assembly code, may have dramatic effects on its running time/efficiency (i.e. which instructions you put, and where they are, can make the difference between 3 things being done in 1 clock cycle, or it taking 3 clock cycles to do the 3 instructions). Just putting in an apparently simple memory access, may slow it down by a huge amount, because it is relying on relatively slow DRAM memory accesses, instead of using the cache (which has run out of capacity, or getting too many random accesses to keep up).
Which can make C the better tool, as it will tend to automatically try to avoid such slowdowns, via its optimization. I suppose I'm saying it's a double edged sword, because on the one hand you might want consistent and fast software responses, and hence need to keep the code optimized. But hand coding in assembly becomes considerably harder, if you have to keep it optimised on modern architectures.

Sorry, the last paragraph, strayed from 'cycle accurate' coding, into potentially highly optimised coding. But some projects, need to do BOTH. I.e. They need to be fast (on a cost efficient MCU) and reasonably consistent with its timing.
« Last Edit: January 20, 2022, 01:23:51 pm by MK14 »
 

Offline Siwastaja

  • Super Contributor
  • ***
  • Posts: 8180
  • Country: fi
Re: [ARM] optimization and inline functions
« Reply #72 on: January 20, 2022, 01:33:14 pm »
"Reasonably consistent" is well said actually, and that is what even the most modern microcontrollers easily achieve. Running the code from external SD card aside (which you shouldn't do for timing-critical routines), it's all pretty much noise. Is this thing going to take 57µs or maybe sometimes 57.1µs? Who cares? Compare this to CS (software) mindset where Java suddenly garbage collecting for five seconds is A-OK.

Even with the old "simple" microcontrollers, actually writing cycle-accurate assembly was not the typical case, but a special rarity. Sometimes to bitbang an interface with no peripheral available.

But today, we have better selection of peripherals. Even then, if no peripheral is available, thanks to just more processing power, you can "bitbang" without cycle accurate code; examples would be combining timer peripherals and code (polling for a flag), or interrupt-driven code. (Example would be an SDLC implementation at 1Mbit/s, which is well possible on Cortex-M7 @ 400MHz without cycle accuracy, but challenging on an AVR @ 16MHz with cycle accuracy).

Just STM32's basic timer's One Pulse Mode, combined with ITCM and relocatable RAM-based vector table, configurable interrupt priorities and software interrupts, creates a freely programmable state machine engine capable of timesteps in excess of 5-10MHz equivalent or so, with jitter in just hundreds of nanoseconds; and it can run the rest of the application in "parallel"! For the speed, this is only one order of magnitude behind FPGA's, basically.

So the fact that writing cycle accurate code is "difficult" is quite uninteresting in practice, because it is very rarely needed. But every now and than that special niche pops up, and if you can save $50 in BOM and 1 year in development time by avoiding turning it into an FPGA project, by using screwdriver as a hammer leaving a few ugly dents in the process, maybe, why not.
« Last Edit: January 20, 2022, 01:43:20 pm by Siwastaja »
 
The following users thanked this post: MK14

Offline Nominal Animal

  • Super Contributor
  • ***
  • Posts: 6266
  • Country: fi
    • My home page and email address
Re: [ARM] optimization and inline functions
« Reply #73 on: January 20, 2022, 02:07:00 pm »
As already mentioned in this thread, by a number of people, there are various suggestions on how to use C code, and get the timing accuracy you desire.
You do not actually get the timing accuracy from the C code; you get it from that specific version of C compiler and linker and options.

I am not saying that one should never do such stuff in C.  I am saying that if you do, the timing is dependent on the C compiler and linker, their exact version, and on the compile options (inclding target options, obviously).  There is absolutely no guarantees or even hints that a different compiler, or even a future version of the same compiler, optimizes it the same way.

While you know this, many beginner C programmers do not, even though they tend to be very keen on optimization.  Thus, being careful on the wording here, to convey an accurate understanding on the limitations, is important.
 
The following users thanked this post: MK14

Offline MK14

  • Super Contributor
  • ***
  • Posts: 4540
  • Country: gb
Re: [ARM] optimization and inline functions
« Reply #74 on: January 20, 2022, 02:41:46 pm »
As already mentioned in this thread, by a number of people, there are various suggestions on how to use C code, and get the timing accuracy you desire.
You do not actually get the timing accuracy from the C code; you get it from that specific version of C compiler and linker and options.

I am not saying that one should never do such stuff in C.  I am saying that if you do, the timing is dependent on the C compiler and linker, their exact version, and on the compile options (inclding target options, obviously).  There is absolutely no guarantees or even hints that a different compiler, or even a future version of the same compiler, optimizes it the same way.

While you know this, many beginner C programmers do not, even though they tend to be very keen on optimization.  Thus, being careful on the wording here, to convey an accurate understanding on the limitations, is important.

I agree with you. It is a problem, as many people, from complete beginners to advanced C programmers, and beyond. May well read (this) and other threads.

You can partially control the timings of the C software, by calling it (interrupts) or polling, the MCUs hardware timers. Right at the start of the applicable software routine. Which then gives it a degree of timing stability/consistency. You can also use the hardware timers, to record the precise end time, so that you can create diagnostic, timing jitter information. To help get the software somewhat right, even when expensive test equipment, is not being used.
For example with PID control loops. It is not necessarily about being run at precisely the right time, perhaps once every millisecond. But more about using the MCUs hardware timers, to read in what the precise time is NOW, do the PID calculations, taking the jitter into account, producing new outputs, and then re-enabling interrupts (if applicable, obviously disable them towards/at the beginning of that section in the code).

There are lots of pitfalls with assembly code. Some customers/products/stuff, has far too woolly/vague requirements and specifications, and/or you can't trust the people in charge, to NOT change their mind on what needs to be done, on a week by week basis. This can make assembly code, especially problematic. A long time ago, there were many good assembly language programmers around. Increasingly these days, it is becoming a rare thing (decent assembly language programmers).
I like to do fun things in assembly. But the latest processors, with thousands of complicated instructions, can be a real chore to write in, rather than the fun experience, older architectures, can be.

EDIT: The vague spec, is not necessarily a problem, as long as the assembly code is strictly limited, to just one or two interface ports, and defined stuff like that. It would be when large chunks of the project are coded in assembly, where regularly changing requirements, would be an issue.
My details in this post, only apply to some particular embedded projects, in the big wide world, there are a huge variety of possibilities, and anyway, some use a real time operating system (RTOS), which is another big ball game in itself.
« Last Edit: January 20, 2022, 02:59:59 pm by MK14 »
 
The following users thanked this post: Nominal Animal

Offline brucehoult

  • Super Contributor
  • ***
  • Posts: 4040
  • Country: nz
Re: [ARM] optimization and inline functions
« Reply #75 on: January 20, 2022, 08:52:51 pm »
You cannot possibly rely on cycle accuracy if you write in C.

You can, if you put the timing critical part in its own module, adjust it until it matches the expectations, using a scope for example, keep it small, compile it once, verify the module, then just keep the object file.

For someone who doesn't feel confident writing assembly, this could be the easiest way.

Or, do gcc -S and keep the .s assembly language file as the source code from then on, possibly after some manual cleanup.
 
The following users thanked this post: MK14

Offline brucehoult

  • Super Contributor
  • ***
  • Posts: 4040
  • Country: nz
Re: [ARM] optimization and inline functions
« Reply #76 on: January 20, 2022, 09:08:16 pm »
Arduino, by their own definition, is not supposed to be used by programmers or electronics designers, but by artists. The whole idea is you can just buy a shield and write led.blink(); and get an art project out of it. It needs to be dumbed down, it needs to be limited. Art projects also don't have strict requirements so you can always work with what you have.

By trying anything more challenging than that, you hit the limits, and it's not Arduino's fault. If you want to blame someone, blame fanboys who don't understand the limits.

But instead, I suggest you just completely ditch the Arduino software ecosystem. You can still use the boards, just program them like you program the microcontroller on the board.

Not only the boards. Other parts of the ecosystem are usable.

The Arduino ecosystem has quite a number of different parts:

- boards with a particular form-factor and connector layout (several of them: Uno, Mega, pro mini), often copied by other manufacturers.

- accessories designed to be plugged onto those boards

- an IDE with a limited but usable text editor and terminal emulator

- a project manager integrated into the IDE. This is pretty crap as soon as you want multiple source files

- a manager for build tools. This is pretty darn useful -- enter a URL in a preferences dialog and BOOM you get a compiler&linker (usually GCC), download tool (avrdude, openocd etc), headers and libraries for a new CPU type instantly, all working together.

- a library providing access to some basic functions of MCUs, with API that is portable to just about anything. Good enough for many beginner tasks, but maligned by "professionals" who sometimes don't seem to understand that they DON'T HAVE TO USE IT and can program some or all of their app to the bare metal instead, if they don't care about portability.

- a set of example programs. Very useful to get beginners going.

- a huge set of 3rd party libraries to interface to just about anything. Code is of very variable quality, and quite a lot of it is a bit AVR specific.

- extensive tutorials all over the internet


I don't understand how anyone can form an opinion that this is not an extremely valuable (if inevitably imperfect) contribution to the microcontroller community.
 
The following users thanked this post: MK14


Share me

Digg  Facebook  SlashDot  Delicious  Technorati  Twitter  Google  Yahoo
Smf