Author Topic: How to get very short delays for timing (32F4xx)  (Read 1581 times)

0 Members and 1 Guest are viewing this topic.

Offline peter-hTopic starter

  • Super Contributor
  • ***
  • Posts: 3701
  • Country: gb
  • Doing electronics since the 1960s...
How to get very short delays for timing (32F4xx)
« on: May 02, 2021, 08:50:04 pm »
I came across this snippet in the code generated by Cube IDE



The ... delay_us is in this case 3, for 3 us. And the counter-- will take 1 cycle per decrement and 1 cycle for the test for zero, unless they get pipelined and then it is just 1 cycle per loop.

Obviously this will give you a minimum delay, and interrupts etc will extend it.

Getting short delays has generally been dodgy because a compiler could optimise-out stuff.
Z80 Z180 Z280 Z8 S8 8031 8051 H8/300 H8/500 80x86 90S1200 32F417
 

Online ataradov

  • Super Contributor
  • ***
  • Posts: 11269
  • Country: us
    • Personal site
Re: How to get very short delays for timing (32F4xx)
« Reply #1 on: May 02, 2021, 09:27:01 pm »
Here is the code I use that is not subject to compiler optimizations or flash wait states and cache optimizations. Still subject to interrupts, of course.
Code: [Select]
__attribute__((noinline, section(".ramfunc")))
static void delay_ms(int ms)
{
  uint32_t cycles = ms * F_CPU / 3 / 1000;

  asm volatile (
    "1: subs %[cycles], %[cycles], #1 \n"
    "   bne 1b \n"
    : [cycles] "+r"(cycles)
  );
}

__attribute__((noinline, section(".ramfunc")))
void delay_cycles(uint32_t cycles)
{
  cycles /= 4;

  asm volatile (
    "1: subs %[cycles], %[cycles], #1 \n"
    "   nop \n"
    "   bne 1b \n"
    : [cycles] "+l"(cycles)
  );
}

Replace "subs" with "sub" for Cortex-M0+.

You can /3 in the last example and remove the "nop", of course. Division by 4 is more efficient, especially on CM0+. See what works better in a participial case.
« Last Edit: May 02, 2021, 09:29:00 pm by ataradov »
Alex
 

Online Doctorandus_P

  • Super Contributor
  • ***
  • Posts: 3369
  • Country: nl
Re: How to get very short delays for timing (32F4xx)
« Reply #2 on: May 02, 2021, 09:32:13 pm »
Adding the "volatile" keyword to the counter variable or an "asm("nop")" will prevent optimising code away and force the compiler to keep your code, though it may still shove things around a bit.

If you want accurate timing, then have a look of using a timer in one-shot mode.

I'm not a fan of using software delays. but sometimes I do use them. One example of a software delay was multiplexing of some relatively big 7-segment displays in an ISR routine. These needed a few us between turning a digit off, and turning the next digit on to prevent ghosting. Timing for this is not critical.

If your needs are beyond something simple like this, then spend some time on the high level design and consider whether you really want to use software delays.
 

Offline rhodges

  • Frequent Contributor
  • **
  • Posts: 306
  • Country: us
  • Available for embedded projects.
    • My public libraries, code samples, and projects for STM8.
Re: How to get very short delays for timing (32F4xx)
« Reply #3 on: May 02, 2021, 09:37:16 pm »
This is what I use. The variable cpu_speed is set in my board setup.
Code: [Select]
/*
 * Loop for N microseconds.
 * Avoid delays that are a significant fraction of one millisecond.
 */
void delay_usecs(int usecs)
{
    delay_cycles(usecs * (cpu_speed / 1000000));
}
/*
 *  Loop for N SysTick (==CPU) cycles
 */
void delay_cycles(int cycles)
{
    int start, diff;

    start = SysTick->VAL;
    for (;;) {
        diff = start - SysTick->VAL;
        if (diff < 0)
            diff += SysTick->LOAD;
        if (diff > cycles)
            break;
    }
}
Currently developing STM8 and STM32. Past includes 6809, Z80, 8086, PIC, MIPS, PNX1302, and some 8748 and 6805. Check out my public code on github. https://github.com/unfrozen
 

Offline peter-hTopic starter

  • Super Contributor
  • ***
  • Posts: 3701
  • Country: gb
  • Doing electronics since the 1960s...
Re: How to get very short delays for timing (32F4xx)
« Reply #4 on: May 02, 2021, 09:37:23 pm »
Yes I would think loading a hardware timer and hanging on it until it goes to zero (or overflows, whichever way the timers work; on many chips they can only increment) is the best way, but is obviously not "thread safe".

Can systick be used for microsecond delays?
« Last Edit: May 02, 2021, 09:39:31 pm by peter-h »
Z80 Z180 Z280 Z8 S8 8031 8051 H8/300 H8/500 80x86 90S1200 32F417
 

Offline cv007

  • Frequent Contributor
  • **
  • Posts: 828
Re: How to get very short delays for timing (32F4xx)
« Reply #5 on: May 02, 2021, 10:14:28 pm »
If you have a M3 or greater, you may have a DWT which has a CYCCNT register. For the nRF52 they use that to do us delays with nrfx_coredep_delay_us, and they are using it quite often themselves for small single digit values so I assume it works well (I use it and its ms version indirectly but not for small values of us).

Google DWT and CYCCNT and you will most likely find something. Whether its better than other solutions, not sure, but you will be counting clock cycles directly and with C code.

For lower resolution times, I usually find a timer of low importance that can run off an internal fixed clock, which can then be used without regard to the cpu clock. Not many change cpu clock speed at runtime I assume (other than sleep/run), so probably not important but it is handy to have something like an rtc or lp timer than can count while the cpu is not running (using a 32k internal clock).  For the samD10 I use the Rtc for any blocking ms delays where I can also set it to use any of the sleep modes and wake via the rtc compare irq, which works out well for power saving.

Another simple option-
https://godbolt.org/z/oE9cMnvhY
If using gcc you can make sure optimization is what you wanted for those functions, and of course the CYCLES_PER_LOOP is not necessarily something you can eyeball by counting instructions, but can be measured. This would fit in the 'better than nothing' category, and also probably in many cases 'good enough'.
« Last Edit: May 03, 2021, 02:24:56 am by cv007 »
 

Online ataradov

  • Super Contributor
  • ***
  • Posts: 11269
  • Country: us
    • Personal site
Re: How to get very short delays for timing (32F4xx)
« Reply #6 on: May 02, 2021, 11:07:21 pm »
Taking over SysTick to just do a simple short delay loop seems like a waste of resources. There are usually better uses for SysTick. Edit: that specific code does not take over the SysTick.

DWT cycle counter is much better if available. But a simple blocking loop is good enough in most cases.
« Last Edit: May 02, 2021, 11:09:26 pm by ataradov »
Alex
 

Offline peter-hTopic starter

  • Super Contributor
  • ***
  • Posts: 3701
  • Country: gb
  • Doing electronics since the 1960s...
Re: How to get very short delays for timing (32F4xx)
« Reply #7 on: May 05, 2021, 10:13:31 am »
Does the ST 32F4 GNU compiler in Cube IDE ever optimise out useless loops?

I see the ST code uses a lot of short delays, which are implemented as a for loop.

I thought that even some CPUs (latest 80x86?) skip stuff like that.
Z80 Z180 Z280 Z8 S8 8031 8051 H8/300 H8/500 80x86 90S1200 32F417
 

Offline cv007

  • Frequent Contributor
  • **
  • Posts: 828
Re: How to get very short delays for timing (32F4xx)
« Reply #8 on: May 05, 2021, 11:57:17 am »
>I see the ST code uses a lot of short delays, which are implemented as a for loop.


They will make the vars volatile. May not be obvious because they created the var in the first part of the function which includes the __IO modifier (volatile), then you do not see it later and wonder how that can work when the compiler eats these unnecessary loops for breakfast (unless you set the compiler to dumb mode).
 

Online ataradov

  • Super Contributor
  • ***
  • Posts: 11269
  • Country: us
    • Personal site
Re: How to get very short delays for timing (32F4xx)
« Reply #9 on: May 05, 2021, 04:16:25 pm »
Yes, unless the variable is volatile, it will be optimized, for sure.

The way I do dumb loops like this is:
Code: [Select]
for (int i = 0; i <100000; i++)
  asm("nop");
You can add volatile specifier to asm() if you want, but in practice no compiler optimizes asm() sections.
Alex
 

Offline peter-hTopic starter

  • Super Contributor
  • ***
  • Posts: 3701
  • Country: gb
  • Doing electronics since the 1960s...
Re: How to get very short delays for timing (32F4xx)
« Reply #10 on: May 05, 2021, 04:45:43 pm »
Presumably instructions which output to pins are never optimised.

The compiler could assume that repeatedly outputting the same value to the pin doesn't do anything, but that would be quite amazing.

I thought NOPs were optimised out by most tools. But maybe not if the optimisation is done only at C source level. Also I thought 80x86 throws them away at execution time, but maybe not these ARM chips.
« Last Edit: May 05, 2021, 04:47:17 pm by peter-h »
Z80 Z180 Z280 Z8 S8 8031 8051 H8/300 H8/500 80x86 90S1200 32F417
 

Online ataradov

  • Super Contributor
  • ***
  • Posts: 11269
  • Country: us
    • Personal site
Re: How to get very short delays for timing (32F4xx)
« Reply #11 on: May 05, 2021, 05:07:07 pm »
There are no instructions that output to the pins. GPIOs are memory mapped, and generally declared as volatile in the header file. So there is nothing compiler can optimize here.

Compilers do not output nops, but there is no compiler at the moment that looks inside asm("....") sections and optimizes that. They just assume it is a valid assembly and use it as is.

x86 still need to fetch and understand that a nop is a nop. So it still takes time, but due to complexity of the architecture, it is meaningless to talk about abstracts cycles. Nop in x86 is an alias to "xchg eax,eax", which is executed as a normal ALU instruction. On Skylake up to 4 such nops would be executed at the same time.
Alex
 

Offline SiliconWizard

  • Super Contributor
  • ***
  • Posts: 14490
  • Country: fr
Re: How to get very short delays for timing (32F4xx)
« Reply #12 on: May 05, 2021, 05:09:29 pm »
Yes, unless the variable is volatile, it will be optimized, for sure.

The way I do dumb loops like this is:
Code: [Select]
for (int i = 0; i <100000; i++)
  asm("nop");
You can add volatile specifier to asm() if you want, but in practice no compiler optimizes asm() sections.

I dunno. I wouldn't count on that though, so I always add the volatile qualifier.

Note that the above (with a nop) is usually more "efficient" than the following, which also works:
Code: [Select]
for (volatile int i = 0; i < N; i++) {}

The first version will usually use a register for the loop counter, the second will usually use a variable on stack for 'i', reading it and writing it back on each iteration, thus for a given N, the second version will take significantly more cycles.

Note that those loops may be implemented a little differently depending on optimization level, so not accurate at all and can give you headaches when you wonder why your code behaves differently when optimized.

So I practically never use this kind of loops and use the DWT counter (or equivalent) register. Except for very small values of N, it's almost cycle-accurate.
 

Online ajb

  • Super Contributor
  • ***
  • Posts: 2608
  • Country: us
Re: How to get very short delays for timing (32F4xx)
« Reply #13 on: May 05, 2021, 08:22:09 pm »
Presumably instructions which output to pins are never optimised.

There are no instructions that output to the pins. GPIOs are memory mapped, and generally declared as volatile in the header file.
 

Tangential aside: Not quite the same thing, but the behavior of the MCU does (potentially) vary depending on the memory address targeted by an instruction according to the device's memory map, and this includes writing to a memory-mapped IO register versus RAM.  The ARMv7-M architecture (Cortex-M3, -M4, and -M7) defines Normal, Device, and Strongly-Ordered memory types.  Accesses to Normal memory are expected to not have side effects, so those accesses are not guaranteed to happen in the order or quantity or at the size that the program specifies.  "Program" here does not refer to the C source code, but the actual machine instructions being executed by the processor, so this is entirely separate from any compiler considerations.  Accesses to Device and Strongly-Ordered memory are assumed to have side effects and are therefore guaranteed to happen in program order, quantity, and size.  In practice, the as-executed memory accesses are more likely to vary from program memory accesses in devices with cache, or more sophisticated devices like the M7, which is dual issue, superscalar, and has a longer pipeline with branch prediction, but the architecture documentation does not guarantee that the M3 or M4 will access memory in any particular way except as required by the memory attributes.  Fortunately most of this is transparent to the programmer, as the memory map for a given device should place all of the registers and memories into parts of the default memory map with the appropriate types, with GPIO and other peripherals in the "Peripheral" block which is defined as "Device" memory, but in some cases you may still need to use memory barriers or MPU attributes to ensure correct behavior. 

The relevant documentation for this is the ARMv7-M Architecture Reference Manual, section A3.5 for the curious/masochistic.
« Last Edit: May 05, 2021, 08:25:53 pm by ajb »
 


Share me

Digg  Facebook  SlashDot  Delicious  Technorati  Twitter  Google  Yahoo
Smf