Author Topic: How to get very short delays for timing (32F4xx) (Read 1581 times)

peter-h · « **on:** May 02, 2021, 08:50:04 pm »

I came across this snippet in the code generated by Cube IDE

The ... delay_us is in this case 3, for 3 us. And the counter-- will take 1 cycle per decrement and 1 cycle for the test for zero, unless they get pipelined and then it is just 1 cycle per loop.

Obviously this will give you a minimum delay, and interrupts etc will extend it.

Getting short delays has generally been dodgy because a compiler could optimise-out stuff.

ataradov · « **Reply #1 on:** May 02, 2021, 09:27:01 pm »

Here is the code I use that is not subject to compiler optimizations or flash wait states and cache optimizations. Still subject to interrupts, of course.

Code: [Select]

__attribute__((noinline, section(".ramfunc")))
static void delay_ms(int ms)
{
  uint32_t cycles = ms * F_CPU / 3 / 1000;

  asm volatile (
    "1: subs %[cycles], %[cycles], #1 \n"
    "   bne 1b \n"
    : [cycles] "+r"(cycles)
  );
}

__attribute__((noinline, section(".ramfunc")))
void delay_cycles(uint32_t cycles)
{
  cycles /= 4;

  asm volatile (
    "1: subs %[cycles], %[cycles], #1 \n"
    "   nop \n"
    "   bne 1b \n"
    : [cycles] "+l"(cycles)
  );
}

Replace "subs" with "sub" for Cortex-M0+.

You can /3 in the last example and remove the "nop", of course. Division by 4 is more efficient, especially on CM0+. See what works better in a participial case.

Doctorandus_P · « **Reply #2 on:** May 02, 2021, 09:32:13 pm »

Adding the "volatile" keyword to the counter variable or an "asm("nop")" will prevent optimising code away and force the compiler to keep your code, though it may still shove things around a bit.

If you want accurate timing, then have a look of using a timer in one-shot mode.

I'm not a fan of using software delays. but sometimes I do use them. One example of a software delay was multiplexing of some relatively big 7-segment displays in an ISR routine. These needed a few us between turning a digit off, and turning the next digit on to prevent ghosting. Timing for this is not critical.

If your needs are beyond something simple like this, then spend some time on the high level design and consider whether you really want to use software delays.

rhodges · « **Reply #3 on:** May 02, 2021, 09:37:16 pm »

This is what I use. The variable cpu_speed is set in my board setup.

Code: [Select]

/*
 * Loop for N microseconds.
 * Avoid delays that are a significant fraction of one millisecond.
 */
void delay_usecs(int usecs)
{
    delay_cycles(usecs * (cpu_speed / 1000000));
}
/*
 *  Loop for N SysTick (==CPU) cycles
 */
void delay_cycles(int cycles)
{
    int start, diff;

    start = SysTick->VAL;
    for (;;) {
        diff = start - SysTick->VAL;
        if (diff < 0)
            diff += SysTick->LOAD;
        if (diff > cycles)
            break;
    }
}

peter-h · « **Reply #4 on:** May 02, 2021, 09:37:23 pm »

Yes I would think loading a hardware timer and hanging on it until it goes to zero (or overflows, whichever way the timers work; on many chips they can only increment) is the best way, but is obviously not "thread safe".

Can systick be used for microsecond delays?

cv007 · « **Reply #5 on:** May 02, 2021, 10:14:28 pm »

If you have a M3 or greater, you may have a DWT which has a CYCCNT register. For the nRF52 they use that to do us delays with nrfx_coredep_delay_us, and they are using it quite often themselves for small single digit values so I assume it works well (I use it and its ms version indirectly but not for small values of us).

Google DWT and CYCCNT and you will most likely find something. Whether its better than other solutions, not sure, but you will be counting clock cycles directly and with C code.

For lower resolution times, I usually find a timer of low importance that can run off an internal fixed clock, which can then be used without regard to the cpu clock. Not many change cpu clock speed at runtime I assume (other than sleep/run), so probably not important but it is handy to have something like an rtc or lp timer than can count while the cpu is not running (using a 32k internal clock). For the samD10 I use the Rtc for any blocking ms delays where I can also set it to use any of the sleep modes and wake via the rtc compare irq, which works out well for power saving.

Another simple option-
https://godbolt.org/z/oE9cMnvhY
If using gcc you can make sure optimization is what you wanted for those functions, and of course the CYCLES_PER_LOOP is not necessarily something you can eyeball by counting instructions, but can be measured. This would fit in the 'better than nothing' category, and also probably in many cases 'good enough'.

ataradov · « **Reply #6 on:** May 02, 2021, 11:07:21 pm »

Taking over SysTick to just do a simple short delay loop seems like a waste of resources. There are usually better uses for SysTick. Edit: that specific code does not take over the SysTick.

DWT cycle counter is much better if available. But a simple blocking loop is good enough in most cases.

peter-h · « **Reply #7 on:** May 05, 2021, 10:13:31 am »

Does the ST 32F4 GNU compiler in Cube IDE ever optimise out useless loops?

I see the ST code uses a lot of short delays, which are implemented as a for loop.

I thought that even some CPUs (latest 80x86?) skip stuff like that.

cv007 · « **Reply #8 on:** May 05, 2021, 11:57:17 am »

>I see the ST code uses a lot of short delays, which are implemented as a for loop.

They will make the vars volatile. May not be obvious because they created the var in the first part of the function which includes the __IO modifier (volatile), then you do not see it later and wonder how that can work when the compiler eats these unnecessary loops for breakfast (unless you set the compiler to dumb mode).

ataradov · « **Reply #9 on:** May 05, 2021, 04:16:25 pm »

Yes, unless the variable is volatile, it will be optimized, for sure.

The way I do dumb loops like this is:

Code: [Select]

for (int i = 0; i <100000; i++)
  asm("nop");

You can add volatile specifier to asm() if you want, but in practice no compiler optimizes asm() sections.

peter-h · « **Reply #10 on:** May 05, 2021, 04:45:43 pm »

Presumably instructions which output to pins are never optimised.

The compiler could assume that repeatedly outputting the same value to the pin doesn't do anything, but that would be quite amazing.

I thought NOPs were optimised out by most tools. But maybe not if the optimisation is done only at C source level. Also I thought 80x86 throws them away at execution time, but maybe not these ARM chips.

ataradov · « **Reply #11 on:** May 05, 2021, 05:07:07 pm »

There are no instructions that output to the pins. GPIOs are memory mapped, and generally declared as volatile in the header file. So there is nothing compiler can optimize here.

Compilers do not output nops, but there is no compiler at the moment that looks inside asm("....") sections and optimizes that. They just assume it is a valid assembly and use it as is.

x86 still need to fetch and understand that a nop is a nop. So it still takes time, but due to complexity of the architecture, it is meaningless to talk about abstracts cycles. Nop in x86 is an alias to "xchg eax,eax", which is executed as a normal ALU instruction. On Skylake up to 4 such nops would be executed at the same time.

SiliconWizard · « **Reply #12 on:** May 05, 2021, 05:09:29 pm »

Quote from: ataradov on May 05, 2021, 04:16:25 pm

Yes, unless the variable is volatile, it will be optimized, for sure.

The way I do dumb loops like this is:
Code: [Select]
for (int i = 0; i <100000; i++) asm("nop");You can add volatile specifier to asm() if you want, but in practice no compiler optimizes asm() sections.

I dunno. I wouldn't count on that though, so I always add the volatile qualifier.

Note that the above (with a nop) is usually more "efficient" than the following, which also works:

Code: [Select]

for (volatile int i = 0; i < N; i++) {}

The first version will usually use a register for the loop counter, the second will usually use a variable on stack for 'i', reading it and writing it back on each iteration, thus for a given N, the second version will take significantly more cycles.

Note that those loops may be implemented a little differently depending on optimization level, so not accurate at all and can give you headaches when you wonder why your code behaves differently when optimized.

So I practically never use this kind of loops and use the DWT counter (or equivalent) register. Except for very small values of N, it's almost cycle-accurate.

ajb · « **Reply #13 on:** May 05, 2021, 08:22:09 pm »

Quote from: peter-h on May 05, 2021, 04:45:43 pm

Presumably instructions which output to pins are never optimised.

Quote from: ataradov on May 05, 2021, 05:07:07 pm

There are no instructions that output to the pins. GPIOs are memory mapped, and generally declared as volatile in the header file.

Tangential aside: Not quite the same thing, but the behavior of the MCU does (potentially) vary depending on the memory address targeted by an instruction according to the device's memory map, and this includes writing to a memory-mapped IO register versus RAM. The ARMv7-M architecture (Cortex-M3, -M4, and -M7) defines Normal, Device, and Strongly-Ordered memory types. Accesses to Normal memory are expected to not have side effects, so those accesses are not guaranteed to happen in the order or quantity or at the size that the program specifies. "Program" here does not refer to the C source code, but the actual machine instructions being executed by the processor, so this is entirely separate from any compiler considerations. Accesses to Device and Strongly-Ordered memory are assumed to have side effects and are therefore guaranteed to happen in program order, quantity, and size. In practice, the as-executed memory accesses are more likely to vary from program memory accesses in devices with cache, or more sophisticated devices like the M7, which is dual issue, superscalar, and has a longer pipeline with branch prediction, but the architecture documentation does not guarantee that the M3 or M4 will access memory in any particular way except as required by the memory attributes. Fortunately most of this is transparent to the programmer, as the memory map for a given device should place all of the registers and memories into parts of the default memory map with the appropriate types, with GPIO and other peripherals in the "Peripheral" block which is defined as "Device" memory, but in some cases you may still need to use memory barriers or MPU attributes to ensure correct behavior.

The relevant documentation for this is the ARMv7-M Architecture Reference Manual, section A3.5 for the curious/masochistic.


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

EEVblog Electronics Community Forum

Author Topic: How to get very short delays for timing (32F4xx) (Read 1581 times)

peter-h

How to get very short delays for timing (32F4xx)

ataradov

Re: How to get very short delays for timing (32F4xx)

Doctorandus_P

Re: How to get very short delays for timing (32F4xx)

rhodges

Re: How to get very short delays for timing (32F4xx)

peter-h

Re: How to get very short delays for timing (32F4xx)

cv007

Re: How to get very short delays for timing (32F4xx)

ataradov

Re: How to get very short delays for timing (32F4xx)

peter-h

Re: How to get very short delays for timing (32F4xx)

cv007

Re: How to get very short delays for timing (32F4xx)

ataradov

Re: How to get very short delays for timing (32F4xx)

peter-h

Re: How to get very short delays for timing (32F4xx)

ataradov

Re: How to get very short delays for timing (32F4xx)

SiliconWizard

Re: How to get very short delays for timing (32F4xx)

ajb

Re: How to get very short delays for timing (32F4xx)

Share me