Author Topic: [ARM] optimization and inline functions  (Read 7693 times)

0 Members and 1 Guest are viewing this topic.

Offline SimonTopic starter

  • Global Moderator
  • *****
  • Posts: 17819
  • Country: gb
  • Did that just blow up? No? might work after all !!
    • Simon's Electronics
Re: [ARM] optimization and inline functions
« Reply #25 on: January 19, 2022, 09:32:49 am »
where exactly do I put the -s option for assembler output?
 

Offline SimonTopic starter

  • Global Moderator
  • *****
  • Posts: 17819
  • Country: gb
  • Did that just blow up? No? might work after all !!
    • Simon's Electronics
Re: [ARM] optimization and inline functions
« Reply #26 on: January 19, 2022, 09:44:15 am »
Is this what you all wanted?
 

Offline eutectique

  • Frequent Contributor
  • **
  • Posts: 392
  • Country: be
Re: [ARM] optimization and inline functions
« Reply #27 on: January 19, 2022, 10:28:19 am »
Is this what you all wanted?

Only this part:

Code: [Select]
void pinT(uint8_t x)
{
( ( *(volatile uint32_t * )( PORTS_BASE + (x >> 5) * PORTn_offset + PORTn_OUTTGL ) ) = 0x01 << (x & PinNumberMask) );
 1c0: 4a05      ldr r2, [pc, #20] ; (1d8 <main+0x30>)
 1c2: 6014      str r4, [r2, #0]
 1c4: 6014      str r4, [r2, #0]
 1c6: e7fc      b.n 1c2 <main+0x1a>
 

Offline MK14

  • Super Contributor
  • ***
  • Posts: 4540
  • Country: gb
Re: [ARM] optimization and inline functions
« Reply #28 on: January 19, 2022, 10:29:31 am »
Is this what you all wanted?

Well it is keeping me happy  8) .
So, it is doing (presumably, as only one version is listed) ONE or TWO of the port toggles (depending on compiler optimization levels, and if code is in a header or alternatively another c file), followed by an unconditional branch. If I'm interpreting things correctly, and seeing the right file, etc.
Hence the duty cycle is 50% for a single (port pin) toggle, but 25% vs 75%. Because the adjacent pair of port toggles, are quick, but the unconditional branch is slower (extra cycle(s) ), on an older generation arm series (M0+).
I'm still surprised it is 25% : 75%, rather than 33.33% : 66.66% duty cycles. But I suppose the branch could be especially slow, because of extra cycle count and/or pipeline delays (mentioned elsewhere).

There could be a GCC flag setting to change how much 'loop unrolling' occurs, which might put more port toggles into the code (or less), depending on the setting.
 

Offline eutectique

  • Frequent Contributor
  • **
  • Posts: 392
  • Country: be
Re: [ARM] optimization and inline functions
« Reply #29 on: January 19, 2022, 10:40:23 am »
Code: [Select]
1c2: str r4, [r2, #0]  <-- toggle bit, 1 cycle
 1c4: str r4, [r2, #0]  <-- toggle bit, 1 cycle
 1c6: b.n 1c2           <-- jump back, 2 cycles

4 cycles in total, 25% duty cycle.
 
The following users thanked this post: MK14

Offline MK14

  • Super Contributor
  • ***
  • Posts: 4540
  • Country: gb
Re: [ARM] optimization and inline functions
« Reply #30 on: January 19, 2022, 10:46:32 am »
Code: [Select]
1c2: str r4, [r2, #0]  <-- toggle bit, 1 cycle
 1c4: str r4, [r2, #0]  <-- toggle bit, 1 cycle
 1c6: b.n 1c2           <-- jump back, 2 cycles

4 cycles in total, 25% duty cycle.

Thanks, that makes it nice and clear.

I am a bit confused as to why it takes 2 rather than 1 cycles for the unconditional branch. I thought it was only CONDITIONAL branches which took longer ?
N.B. Not disagreeing, just wondering why. Essentially it takes one more cycle for the unconditional branch, compared to a normal (1 cycle) instruction. I did read it might be to do with the pipeline, while making the answer for Simon.
I suppose it is because the pipeline needs an extra cycle, to know what the next instruction is, or something like that.
« Last Edit: January 19, 2022, 10:49:23 am by MK14 »
 

Online Siwastaja

  • Super Contributor
  • ***
  • Posts: 8180
  • Country: fi
Re: [ARM] optimization and inline functions
« Reply #31 on: January 19, 2022, 10:48:32 am »
Yeah, and now running from flash vs. running from RAM might become a difference, too, if clock frequency is high enough so that flash needs wait states, prefetching multiple instructions so that "linear" code runs at full speed, but jumps require waiting for a flash access. Then again, some very simple cache system like ST's "ART accelerator" might make that jump happen in 1 cycle, after all.

where exactly do I put the -s option for assembler output?

-S, not -s, and this is on GCC command line. If using IDE, refer to its documentation how to generate assembly output. Starting a debugging session would be an obvious way to see the assembly and even single-step through it.
 
The following users thanked this post: MK14

Offline Nominal Animal

  • Super Contributor
  • ***
  • Posts: 6266
  • Country: fi
    • My home page and email address
Re: [ARM] optimization and inline functions
« Reply #32 on: January 19, 2022, 10:52:50 am »
To make it absolutely clear to those who do not understand the disassembly, the key part is that
Code: [Select]
1aa: 2480  movs  r4, #128 ; r4 = 0x80
     [unrelated stuff]
1b2: 00a4  lsls  r4, r4, #2 ; r4 = 0x200
     [unrelated stuff]
1c0: 4a05  ldr   r2, [pc, #20] ; r2 = 0x1d8

1c2: 6014  str   r4, [r2, #0]
1c4: 6014  str   r4, [r2, #0]
1c6: e7fc  b.n   1c2

1d8: 6000009c .word 0x6000009c
is equivalent to
Code: [Select]
    while (1) {
        *(volatile uint32_t *)0x6000009c = 0x200;
        *(volatile uint32_t *)0x6000009c = 0x200;
    }
in C.

I am a bit confused as to why it takes 2 rather than 1 cycles for the unconditional branch. I thought it was only CONDITIONAL branches which took longer ?
No: Conditional branches on Cortex-M0+ take 2 cycles if taken, 1 if not taken; unconditional branches take 2 cycles; unconditional branches with link take 3 cycles; and unconditional branches with link, or with link and exchange, take 2 cycles.  Slightly odd, but it's documented that way. 
« Last Edit: January 19, 2022, 10:56:09 am by Nominal Animal »
 
The following users thanked this post: MK14

Offline eutectique

  • Frequent Contributor
  • **
  • Posts: 392
  • Country: be
Re: [ARM] optimization and inline functions
« Reply #33 on: January 19, 2022, 10:54:00 am »
On M0+ cores, the conditional branches are executed in 1 cycle if not taken, and in 2 cycles if taken. True, because of pipeline.
 
The following users thanked this post: MK14

Offline MK14

  • Super Contributor
  • ***
  • Posts: 4540
  • Country: gb
Re: [ARM] optimization and inline functions
« Reply #34 on: January 19, 2022, 10:59:06 am »
I am a bit confused as to why it takes 2 rather than 1 cycles for the unconditional branch. I thought it was only CONDITIONAL branches which took longer ?
No: Conditional branches on Cortex-M0+ take 2 cycles if taken, 1 if not taken; unconditional branches take 2 cycles; unconditional branches with link take 3 cycles; and unconditional branches with link, or with link and exchange, take 2 cycles.  Slightly odd, but it's documented that way.

Thanks, I see. So, ironically conditional branches can actually be 1 cycle faster than unconditional ones. If you suitably re-arrange the software, so that the conditional branch (mostly) is NOT taken.
 

Offline brucehoult

  • Super Contributor
  • ***
  • Posts: 4040
  • Country: nz
Re: [ARM] optimization and inline functions
« Reply #35 on: January 19, 2022, 11:07:42 am »
Code: [Select]
1c2: str r4, [r2, #0]  <-- toggle bit, 1 cycle
 1c4: str r4, [r2, #0]  <-- toggle bit, 1 cycle
 1c6: b.n 1c2           <-- jump back, 2 cycles

4 cycles in total, 25% duty cycle.

Thanks, that makes it nice and clear.

I am a bit confused as to why it takes 2 rather than 1 cycles for the unconditional branch. I thought it was only CONDITIONAL branches which took longer ?
N.B. Not disagreeing, just wondering why. Essentially it takes one more cycle for the unconditional branch, compared to a normal (1 cycle) instruction. I did read it might be to do with the pipeline, while making the answer for Simon.
I suppose it is because the pipeline needs an extra cycle, to know what the next instruction is, or something like that.

Minimum size CPU cores such as the Cortex M0 or various tiny RISC-V cores (SiFive E20, PULP Zero-Riscy) dispense with such "wastes" of silicon area as branch prediction and caches. They somewhat compensate by using a 2 or 3 stage pipeline instead of 5 stage (or more), which limits the clock speed but also keeps the "branch mispredict penalty" (*every* branch, since they don't try to predict) down to 1 extra cycle.

It also means that -- contrary to what some are saying here -- it is, by design, very easy to write cycle-accurate code on them.
 
The following users thanked this post: hans, MK14

Offline eutectique

  • Frequent Contributor
  • **
  • Posts: 392
  • Country: be
Re: [ARM] optimization and inline functions
« Reply #36 on: January 19, 2022, 11:08:25 am »
So, ironically conditional branches can actually be 1 cycle faster than unconditional ones. If you suitably re-arrange the software, so that the conditional branch (mostly) is NOT taken.

Yep, and gcc provides a function for this, __builtin_expect().

The use would be:

Code: [Select]
#define likely(x)      __builtin_expect(!!(x), 1)
#define unlikely(x)    __builtin_expect(!!(x), 0)

if (likely (condition)) {
    ...
} else {
    ...
}

 
The following users thanked this post: hans

Offline brucehoult

  • Super Contributor
  • ***
  • Posts: 4040
  • Country: nz
Re: [ARM] optimization and inline functions
« Reply #37 on: January 19, 2022, 11:09:45 am »
I am a bit confused as to why it takes 2 rather than 1 cycles for the unconditional branch. I thought it was only CONDITIONAL branches which took longer ?
No: Conditional branches on Cortex-M0+ take 2 cycles if taken, 1 if not taken; unconditional branches take 2 cycles; unconditional branches with link take 3 cycles; and unconditional branches with link, or with link and exchange, take 2 cycles.  Slightly odd, but it's documented that way.

Thanks, I see. So, ironically conditional branches can actually be 1 cycle faster than unconditional ones. If you suitably re-arrange the software, so that the conditional branch (mostly) is NOT taken.

That's still 1 cycle slower than an unconditional branch to the next instruction, which can be optimised out by the programmer/assembler/linker, thus taking 0 cycles.
 
The following users thanked this post: MK14

Offline MK14

  • Super Contributor
  • ***
  • Posts: 4540
  • Country: gb
Re: [ARM] optimization and inline functions
« Reply #38 on: January 19, 2022, 11:24:46 am »
That's still 1 cycle slower than an unconditional branch to the next instruction, which can be optimised out by the programmer/assembler/linker, thus taking 0 cycles.

True, you are right.
But I really meant if it was hand coded assembly code, and there were alternative ways of coding the inner most parts of a loop. One way, with the inner most being an unconditional branch, the other way being a conditional branch (but which is NOT normally taken).
The eventual loop back to the beginner of the loop, being another additional branch instruction.

The above explanation is probably NOT the best or most efficient/fastest, way of programming it. But just my way of explaining the concept.
« Last Edit: January 19, 2022, 11:28:10 am by MK14 »
 

Offline MK14

  • Super Contributor
  • ***
  • Posts: 4540
  • Country: gb
Re: [ARM] optimization and inline functions
« Reply #39 on: January 19, 2022, 11:44:51 am »
Minimum size CPU cores such as the Cortex M0 or various tiny RISC-V cores (SiFive E20, PULP Zero-Riscy) dispense with such "wastes" of silicon area as branch prediction and caches. They somewhat compensate by using a 2 or 3 stage pipeline instead of 5 stage (or more), which limits the clock speed but also keeps the "branch mispredict penalty" (*every* branch, since they don't try to predict) down to 1 extra cycle.

It also means that -- contrary to what some are saying here -- it is, by design, very easy to write cycle-accurate code on them.

Thanks, that makes a huge amount of sense. Cycle-accurate code (hence deterministic), is extremely useful for embedded applications, especially when bit-banging, but also for working out the latencies and checking for the worst case execution times and latencies.

It is amazing how small the 'tiny' RISC-V cores, have been able to be made. Like the 1-bit serial ones, not especially fast (slow), but fit into amazingly tight (FPGA) spaces.

I'm use to the luxury of later architectures (such as PCs), which can eliminate unconditional branches (assuming the compiler/linker etc, put them in the code), by using the out-of-order and/or instruction-pre-fetching queues. I.e. It can run as fast as if the unconditional branch, wasn't even there, in most cases.
« Last Edit: January 19, 2022, 11:47:22 am by MK14 »
 

Online Siwastaja

  • Super Contributor
  • ***
  • Posts: 8180
  • Country: fi
Re: [ARM] optimization and inline functions
« Reply #40 on: January 19, 2022, 12:39:41 pm »
It also means that -- contrary to what some are saying here -- it is, by design, very easy to write cycle-accurate code on them.

Yes, it is quite simple. Cortex M7 has branch prediction (which can be turned off for execution of cycle-accurate routines!) and dual issue, which make this harder, but if we forget about the M7, simpler 32-bit ARMs are not that weird.

There are a few traps for young players, like flash wait states and prefetch, but otherwise than that - you definitely can write cycle-accurate code, like on AVR, and I have definitely done that once or twice, and I don't find it difficult.

Now, the point some are making is not as much of it being difficult, but more that it's simply not necessary, as often as it used to be. The point I repeatedly make when replying tggzzz's posts, in actual control applications (instead of academic interest), only absolute time matters. If something needs to happen at earliest in 1µs and at latest in 2µs, this might require assembly programming and cycle counting on AVR running at 8MHz, but is a breeze to do on Cortex-M7 @ 400MHz, using interrupts and configurable interrupt priorities, in simple C.
 
The following users thanked this post: hans

Offline brucehoult

  • Super Contributor
  • ***
  • Posts: 4040
  • Country: nz
Re: [ARM] optimization and inline functions
« Reply #41 on: January 19, 2022, 12:40:30 pm »
That's still 1 cycle slower than an unconditional branch to the next instruction, which can be optimised out by the programmer/assembler/linker, thus taking 0 cycles.

True, you are right.
But I really meant if it was hand coded assembly code, and there were alternative ways of coding the inner most parts of a loop. One way, with the inner most being an unconditional branch, the other way being a conditional branch (but which is NOT normally taken).
The eventual loop back to the beginner of the loop, being another additional branch instruction.

I'd struggle to think of a CPU, simple of not, on which it is faster to execute...

Code: [Select]
loop:
   // stuff x, possible empty
   Bcond exit
   // stuff y
   B loop
exit:

... than ...

Code: [Select]
  B entry
loop:
  // stuff y
entry:
  // stuff x
  Binvcond loop

In the rather common case where stuff x is in fact empty e.g. a common or garden while(){} loop, it's generally faster and no more code size to even do this ...

Code: [Select]
  Bcond exit
loop:
  // stuff y
  Binvcond loop
exit:
« Last Edit: January 19, 2022, 12:42:08 pm by brucehoult »
 

Offline MK14

  • Super Contributor
  • ***
  • Posts: 4540
  • Country: gb
Re: [ARM] optimization and inline functions
« Reply #42 on: January 19, 2022, 12:58:15 pm »
I'd struggle to think of a CPU, simple of not, on which it is faster to execute...

Sure, here we go. Always starts at PointA, always (except when carry==1) must finish at PointB. Assumes C(arry) flag is normally/mostly clear for this program extract.

Code: [Select]

PointA:
BEQ OtherStuff

DoSomeStuff....
BRA PointB  // Alternative scenario would be ... +2 Cycles // Either put this line in
             // BCS SomewhereToDoStuff    {Which will fall into OtherStuff:..} ... Or +1 Cycles // Or This line             

OtherStuff:....

PointB
// Checks the status of Carry flag and does something if it is set
BCS SomewhereToDoStuff



EDIT: That example above, is not especially convincing, let me try and explain it in words. The unconditional branch, only jumps and consumes 2 cycles.
But the conditional jump, only takes one cycle (when NOT taken), but also usefully checks the status of something (usually a flag), and so it can occasionally/rarely actually take the branch as well. All for the 1 cycle (2 if branch taken). So there is plenty of room for a speed up, as it is one less cycle, and extra functionality (an optional condition check).
tl;dr
It should be possible to get a speed up, even if I can't immediately create a complete and efficiently program that does, demonstrate it in that code section, above. But, at this stage, I won't rule out, it NOT being possible to gain a speed advantage.
« Last Edit: January 19, 2022, 01:35:06 pm by MK14 »
 

Online hans

  • Super Contributor
  • ***
  • Posts: 1641
  • Country: nl
Re: [ARM] optimization and inline functions
« Reply #43 on: January 19, 2022, 03:53:49 pm »
eBay auction: #
It also means that -- contrary to what some are saying here -- it is, by design, very easy to write cycle-accurate code on them.

Yes, it is quite simple. Cortex M7 has branch prediction (which can be turned off for execution of cycle-accurate routines!) and dual issue, which make this harder, but if we forget about the M7, simpler 32-bit ARMs are not that weird.

There are a few traps for young players, like flash wait states and prefetch, but otherwise than that - you definitely can write cycle-accurate code, like on AVR, and I have definitely done that once or twice, and I don't find it difficult.

Now, the point some are making is not as much of it being difficult, but more that it's simply not necessary, as often as it used to be. The point I repeatedly make when replying tggzzz's posts, in actual control applications (instead of academic interest), only absolute time matters. If something needs to happen at earliest in 1µs and at latest in 2µs, this might require assembly programming and cycle counting on AVR running at 8MHz, but is a breeze to do on Cortex-M7 @ 400MHz, using interrupts and configurable interrupt priorities, in simple C.

I still find writing cycle-accurate code a bit iffy. I can see where Bruce is coming from: the pipeline is short and a branch will only flush 1 stage, so it's easy to predict how fast code runs. FLASH accelerators/caches can be prevented by putting the critical routine in (TCM)RAM. Put variables in a different SRAM so there is no bus conflict on mem-load/store instructions. I can see how that will work out.. but when streamlining development I would have some concerns with that, some are technical.. some are process of developing I suppose.

First, the example of Simon demonstrates an >50x speed-up from 45kHz (no optimization), to 111kHz (no inlining) to 1MHz.. with a peak toggle rate of 2MHz, unfortunately we can't unroll the loop infinitely. An IRQ that must be handled with a short deadline (e.g. your 2us example) is probably not a problem on a fast MCU, but perhaps only when the code is compiled with some kind of optimization. Or written in assembler so that it is hardened against what the compiler does. It's not useful to have a program that breaks on different compiler settings, as also demonstrated with my race condition example.
(You could choose optimizations per source file or even function to mitigate this problem to some degree. And in some applications, like real-time control, setting breakpoints can be a sin, so having code work on multiple optimization settings is not necessary)

I like to keep my code portable as much as I can. I'm not going to port my application to a different MCU every week (although with these part shortages...), I do want to be able to test functional behaviour of the exact same code on a PC (unit tests) and on the actual MCU. Unit tests are hard to run for assembler code, unless one happens to have an emulator.

In addition, I'd rather bit-bang a protocol with DMA/timers/GPIOs/SPI and some glue logic. The DMA buffers that are read/written can be unit tested. However, timing is not possible to unittest on a PC, so therefore I prefer to map timing critical operations onto hardware peripherals.

It's OK not to agree with my concerns. Perhaps my approach has been influenced too much on regular software programming techniques/trends (like test-driven development and continuous integration systems). Or perhaps I haven't come across a project where a requirement needs to absolutely be squeezed on a small CPU, where there is no other way than doing the hard work directly in assembler with just the MCU and a programming cable. But if all possible, I like to make those tedious/error-prone design steps not necessary..
« Last Edit: January 19, 2022, 03:56:05 pm by hans »
 

Offline NorthGuy

  • Super Contributor
  • ***
  • Posts: 3147
  • Country: ca
Re: [ARM] optimization and inline functions
« Reply #44 on: January 19, 2022, 03:57:12 pm »
It also means that -- contrary to what some are saying here -- it is, by design, very easy to write cycle-accurate code on them.

The absence of branch prediction doesn't make the CPU cycle-accurate. There are caches and bus contention. For example, the CPU may compete with DMA for bus cycles which may lead to unpredictable delays.

BTW: MIPS eliminates "branch prediction penalty" completely by using delay slots. This doesn't make it cycle-accurate neither.
 

Online Siwastaja

  • Super Contributor
  • ***
  • Posts: 8180
  • Country: fi
Re: [ARM] optimization and inline functions
« Reply #45 on: January 19, 2022, 04:15:23 pm »
Well, microcontroller projects (the ones that actually control things, using peripherals etc.) just are not portable, we have to accept this. By trying to make them portable, we either sacrifice so much of the MCU capability that we just say "no" to the projects - or use FPGAs to implement them. Or, we are writing and verifying so much extra abstraction that the original project would have been manually ported ten times when the "portable" project finishes.

IMHO, the key to success is not to strive for ideal world that does not exist in MCUs. After you accept this, you can leverage the features that are there, for the low cost, and save a lot of time and money compared to rolling out an FPGA design (or using an esoteric, vendor-lock-in solution like the XMOS).

You need to make compromises regarding idealism. You have to accept that -O0 is not supposed to produce a working project, so... just don't use it. You need to rely on compiler optimizations, but only to a point, not for cycle accuracy. If you need cycle accuracy, you are on a special case, and you need to prove yourself and the others that other ways of doing it are even more difficult or expensive.

A simple example: you need to react to an analog event within 100 clock cycles, by setting a pin high. You set up the comparator registers, write the interrupt address to the vector table, enable the interrupt at highest priority, and as a very first operation on the ISR function, write to the GPIO register. 10 minutes of work. You test it, and it works perfectly, as expected.

Then you start to think about it. Interrupt entry latency is 12 cycles. Does the GPIO operation require loading an address to a register, from program memory? Am I running code out of flash and if yes, is this part of program memory beyond the prefetch range of the flash? Heck, even if the code in in ITCM, is the vector table in flash? If it is, does the core load the vector address in parallel to stacking the registers? Probably yes but do I need to fully read the Cortex-Mwhatever manual every time I do this?

And at some point, thinking about it changes to overthinking about it. The threshold depends on the margin you originally had.

But in the end, having the port register qualified volatile also means, the compiler cannot reorder the ISR so that the port write would be the last thing. Compiler is of course allowed to insert an unnecessary calculation of Pi before that, but why would it do that?

Finally, you measure the latency with an oscillosscope and see that it's actually taking 17 cycles +/- 1 cycle of jitter and once in a year, when a DMA transfer is triggered during full moon and Michael Jackson's Thriller is playing in the radio, it has 3 cycles of jitter(!!).

Now, tggzzz would say you have proven nothing. But realistically, what are the chances that this breaks down beyond 100 clock cycles when the GCC version updates?

I don't know. At the same time, I get work done. And so does everybody else who works like this. And I have never, ever in my life had an issue where a high-priority interrupt execution would have significantly changed in timing due to some seemingly unrelated change. A few cycles, sure!

And quite frankly, keeping a fixed GCC version during an embedded MCU project is the sane thing to do. This isn't high performance desktop computing requiring security updates. If the original microcontroller chip stays the same, if the code is verified to work within specifications with good margins, why would you suddenly update to a new compiler version during production?
« Last Edit: January 19, 2022, 04:19:38 pm by Siwastaja »
 
The following users thanked this post: nctnico, MK14

Offline NorthGuy

  • Super Contributor
  • ***
  • Posts: 3147
  • Country: ca
Re: [ARM] optimization and inline functions
« Reply #46 on: January 19, 2022, 04:39:05 pm »
A simple example: you need to react to an analog event within 100 clock cycles, by setting a pin high. You set up the comparator registers, write the interrupt address to the vector table, enable the interrupt at highest priority, and as a very first operation on the ISR function, write to the GPIO register. 10 minutes of work. You test it, and it works perfectly, as expected.

But the comparator can set a pin high by itself, without any help from CPU, with much better latency.
 

Offline nctnico

  • Super Contributor
  • ***
  • Posts: 26909
  • Country: nl
    • NCT Developments
Re: [ARM] optimization and inline functions
« Reply #47 on: January 19, 2022, 04:55:14 pm »
Well, microcontroller projects (the ones that actually control things, using peripherals etc.) just are not portable, we have to accept this. By trying to make them portable, we either sacrifice so much of the MCU capability that we just say "no" to the projects - or use FPGAs to implement them. Or, we are writing and verifying so much extra abstraction that the original project would have been manually ported ten times when the "portable" project finishes.

IMHO, the key to success is not to strive for ideal world that does not exist in MCUs. After you accept this, you can leverage the features that are there, for the low cost, and save a lot of time and money compared to rolling out an FPGA design (or using an esoteric, vendor-lock-in solution like the XMOS).

Finally, you measure the latency with an oscillosscope and see that it's actually taking 17 cycles +/- 1 cycle of jitter and once in a year, when a DMA transfer is triggered during full moon and Michael Jackson's Thriller is playing in the radio, it has 3 cycles of jitter(!!).

Now, tggzzz would say you have proven nothing. But realistically, what are the chances that this breaks down beyond 100 clock cycles when the GCC version updates?
Very small.

I agree with your pragmatic approach. IMHO products become way too fragile when they need to rely on cycle accurate execution. That is something you don't want to really care about when writing software in C for a product. Both from a development time perspective (NRE costs) and a maintainance / life cycle perspective (project handover to a different programmer). Cycle accurate execution is nice for esoteric tinkering but not for real world products.

In many cases there is a simple workaround possible that gives both flexibility in software timing and provides perfectly predictable timing for the hardware.
« Last Edit: January 19, 2022, 04:57:51 pm by nctnico »
There are small lies, big lies and then there is what is on the screen of your oscilloscope.
 

Offline eutectique

  • Frequent Contributor
  • **
  • Posts: 392
  • Country: be
Re: [ARM] optimization and inline functions
« Reply #48 on: January 19, 2022, 05:08:31 pm »
And while we are talking about Cortex M0+, gcc, and code generation, it would be worthwhile to mention that gcc produces bloated code which will waste flash space and CPU cycles (and, hence, battery life):

https://community.nxp.com/t5/MCUXpresso-IDE/M0-M0-optimization-bug-in-GCC/m-p/653235

Just be warned.
 

Offline MK14

  • Super Contributor
  • ***
  • Posts: 4540
  • Country: gb
Re: [ARM] optimization and inline functions
« Reply #49 on: January 19, 2022, 05:12:55 pm »
Cycle accurate execution is nice for esoteric tinkering but not for real world products.

Cycle accurate execution, if used, is a pain in the neck to write. Even more of a pain, if the timing was bang on, but then you find bugs in the software, and have to somehow change/repair it, and yet keep the cycle totals, within bounds. Worse yet, the customer(s) requirements, legislation, cpu used etc change, somewhat forcing much/all of the work to be redone again.
You're right, it is best avoided, and often the hardware peripheral set, can create the accurate timing for you and/or rapid responses to interrupts, can make it 'good enough', including the jitter that introduces.

On the other hand, determining the worst case interrupt latencies (especially beyond measuring it over a period of time), might be easier if the architecture is basically cycle accurate. The more somewhat indeterminate mechanisms, such as cache hits, long pipelines, instructions with widely varying cycle time delays (e.g. some divide instructions, e.g. the more 1 bits set, the slower, depending on how divide works in that cpu), aforementioned DMA and other stuff. Can make it harder and harder to determine the worst case interrupt timing delays, hence more variable (jitter).
« Last Edit: January 19, 2022, 05:16:15 pm by MK14 »
 


Share me

Digg  Facebook  SlashDot  Delicious  Technorati  Twitter  Google  Yahoo
Smf