Author Topic: Do i really need to use an RTOS? (Alternatives to finite state machines?)  (Read 15665 times)

0 Members and 1 Guest are viewing this topic.

Offline newbrain

  • Super Contributor
  • ***
  • Posts: 1719
  • Country: se
Quote
LOVE treating so many things as separate programs. I have no doubt that adding changes will be easier.
QFT, parole sante.
These are exactly the two things I (as a hobbyist, remember) find are the best advantage of using an RTOS.

Different functions of the system (e.g. HMI and data processing) can be treated as completely separate - once the right priorities have been assigned and the amount of resources used verified.
I am sure that if a new audio frame has arrived, the processing will happen in time - even if the GUI might slow down.

And, for the same reason, changing/adding functions is easier.

If resources are limited, of course, the overhead might be too much.

Some personal examples:

* Linear PSU - HD44780 4×20 display, 2 encoders, 2 buttons, V,I,P measurements, OCP with programmable delay, I & V DACs control etc.: with an STM042k6 (6KB RAM) I went for a superloop. Did not even try with FreeRTOS. Changing something can be done, but it needs a lot of care.

* AD9834 based sine generator: on a STM32F072 FreeRTOS allows me to add both interface (parallel 480×320 TFT, one encoder) and real time (Lastly: a couple of hours  to add SSB modulation) tasks without too much hassle.

* iMX RT1021 SDR: Here FreeRTOS shines, the audio processing chain (FIRs, FFT, various demodulations) is served at the highest priority, UI - two encoders, some buttons, SPI 480×320 TFT - at (almost) the lowest. Other tasks might be in the middle (slow decoders such as RTTY, MORSE etc.). An FFT+waterfall display was easily added without touching the rest.

I found I'm usually more comfortable with pre-emption with no time slicing.
Nandemo wa shiranai wa yo, shitteru koto dake.
 

Offline brucehoult

  • Super Contributor
  • ***
  • Posts: 4040
  • Country: nz
I found I'm usually more comfortable with pre-emption with no time slicing.

That's fine if you have 0 or 1 CPU-bound tasks.

More than 1 and you *need* either time slicing or yield() calls in your loops.

If you already have pre-emption rather than simply interrupt handlers ... i.e. an interrupt handler can cause control to return to a different (higher priority, newly unblocked) task than the one that was interrupted ... then time slicing is nothing more extra than a clock interrupt and a fairness algorithm.
 

Offline westfw

  • Super Contributor
  • ***
  • Posts: 4199
  • Country: us
Quote
you *need* either time slicing or yield() calls in your loops.
Or other calls that implicitly yield...
 
The following users thanked this post: newbrain

Offline SiliconWizard

  • Super Contributor
  • ***
  • Posts: 14488
  • Country: fr
Well uh. A preemptive scheduler is uh... preemptive. It must be able to preempt. Sure it can do so by other means than a timer, but you still need something that preempts, externally from the tasks themselves. If tasks are solely responsible for "yielding", that's not preemptive scheduling, that's merely cooperative.

Now of course with a preemptive scheduler, tasks can always "yield" earlier than their allocated time slot if they have nothing more to do (usually waiting on some event) - which is the only way of making a scheduler work at all if the number of tasks times the time slot is greater than 100%. So in that regard, the time slot is the *max* time slot for a given task. But there still needs to be an external event that can preempt. If it's not a timer, it needs to be something else, but not something that would entirely depend on each task, otherwise you're again not in the preemptive territory anymore.

I haven't looked at everything FreeRTOS offered, for instance, but I've implemented a preemptive scheduler with which you can define a different (max) time slot for each task, instead of a fixed one. I haven't checked whether you could do that with FreeRTOS.
 

Offline Siwastaja

  • Super Contributor
  • ***
  • Posts: 8179
  • Country: fi
Quote
LOVE treating so many things as separate programs. I have no doubt that adding changes will be easier.
QFT, parole sante.
These are exactly the two things I (as a hobbyist, remember) find are the best advantage of using an RTOS.
...
I found I'm usually more comfortable with pre-emption with no time slicing.

The feelings of you two sound pretty much similar to the enjoyment I was having when I realized, on Cortex-M, that you don't have to set some flags and sequentially process them in a superloop, but you can make a fully ISR-driven design, with pre-emption with IRQ priorities, plus software interrupts to trigger lower priority interrupts from higher ones.

Everything just... works, running in parallel. Modules can be treated as "separate programs", and state machine is implemented as just functions, triggered by something - usually HW timer or other peripheral. No problem doing "long" ISRs - just make more important stuff higher priority, and pre-emption works.

And it doesn't need to be a generic 1ms tick, and you don't need to write any kind of scheduler, at all, the CPU can take care of it: just set a timer peripheral for whatever delay you actually need, and make it trigger the next state function directly.

The most obvious advantages for the OS would be addition of time-slicing, and the fact a trigger can bring the program flow in the middle of a function (wait for event / semaphore call). With simple ISR-based design, while long pieces of code are possible and can be pre-empted, trigger mechanisms always bring you to the start of the function. I don't think this is a bad thing, having trigger mechanisms coupled with handler functions improves readability, like in JavaScript you use a small onClick() handler instead of some massive loop where you wait-for-click.
 

Online tggzzz

  • Super Contributor
  • ***
  • Posts: 19517
  • Country: gb
  • Numbers, not adjectives
    • Having fun doing more, with less
Quote
LOVE treating so many things as separate programs. I have no doubt that adding changes will be easier.
QFT, parole sante.
These are exactly the two things I (as a hobbyist, remember) find are the best advantage of using an RTOS.
...
I found I'm usually more comfortable with pre-emption with no time slicing.

The feelings of you two sound pretty much similar to the enjoyment I was having when I realized, on Cortex-M, that you don't have to set some flags and sequentially process them in a superloop, but you can make a fully ISR-driven design, with pre-emption with IRQ priorities, plus software interrupts to trigger lower priority interrupts from higher ones.

Everything just... works, running in parallel.

Except that they are't actually running in parallel, and there can be "gotchas" that appear in very well disciplined high-reliability environments. Start by understanding why the first space shuttle launch attempt (1981-4-10) was scrubbed and the software patch installed before thr first launch occurred (1981-4-12).

Quote
Modules can be treated as "separate programs", and state machine is implemented as just functions, triggered by something - usually HW timer or other peripheral. No problem doing "long" ISRs - just make more important stuff higher priority, and pre-emption works.

If and only if you get everything right. If not, livelock (for example) can occur. Understand why the Mars Pathfinder computer repeatedly reset itself, had to be remotely debugger, and a config bit changed. (That also showed the value of logging events+states in the running system; don't turn debugging off!).

There is a lot of well-understood theory about RTOSs, and practical examples of the subtle intermittent problems that do occur when the theory is ignored.

Summary: no, they don't "just work" - but they might appear to "just work".
There are lies, damned lies, statistics - and ADC/DAC specs.
Glider pilot's aphorism: "there is no substitute for span". Retort: "There is a substitute: skill+imagination. But you can buy span".
Having fun doing more, with less
 

Offline Siwastaja

  • Super Contributor
  • ***
  • Posts: 8179
  • Country: fi
Problem of managing shared resources and proving availability of CPU resources with worst-case event rates, and limiting the rate events can happen, are common problem to OS- or bare metal design. Just the tools available and terminology are somewhat different. There is no silver bullet to it.

Parallelism is notoriously difficult to get right. And in my opinion, purely event-driven code is easier to understand and get right than time-slicing linear code with a lot of mutexes. And, of course, avoiding sharing of resources by design as much as possible. FIFOs are generally a better idea than sharing a resource with mutexes, for example.

Learning from past mistakes is always a good idea. For example the classic "computer overload" Apollo thing which was basically equivalent of wiring an interrupt pin to a CPU from an external source which is not proven to limit the minimum interrupt period. The takeaway is, if there is any uncertainty about the minimum period (or even signal integrity causing false edges), pattern where that IRQ is temporarily disabled to be re-enabled by timer IRQ can be used to limit worst case rate.
« Last Edit: July 08, 2022, 10:29:49 am by Siwastaja »
 

Online tggzzz

  • Super Contributor
  • ***
  • Posts: 19517
  • Country: gb
  • Numbers, not adjectives
    • Having fun doing more, with less
Parallelism is notoriously difficult to get right. And in my opinion, purely event-driven code is easier to understand and get right than time-slicing linear code with a lot of mutexes. And, of course, avoiding sharing of resources by design as much as possible. FIFOs are generally a better idea than sharing a resource with mutexes, for example.

That's pretty much my belief.

Event->interrupt->capture event put in fifo->return from interrupt. Forever loop: wait until event in fifo->process event to completion. Completion can be putting an event in the same queue or sending it to another FSM for processing.

That is simple and, coupled with the half-sync-half-async design pattern, is efficient and predictable. It does require suitable events that can allow "processing event to completion", but that is usually not a major problem in real-world systems.
There are lies, damned lies, statistics - and ADC/DAC specs.
Glider pilot's aphorism: "there is no substitute for span". Retort: "There is a substitute: skill+imagination. But you can buy span".
Having fun doing more, with less
 

Online tellurium

  • Regular Contributor
  • *
  • Posts: 231
  • Country: ua
Everything just... works, running in parallel. Modules can be treated as "separate programs", and state machine is implemented as just functions, triggered by something - usually HW timer or other peripheral. No problem doing "long" ISRs - just make more important stuff higher priority, and pre-emption works.

I'd like to see a simple project, e.g. blinky with  UART based control (e.g. to change blink intervals, blink counts, or pwm), implemented in 3 paradigms:
1. superloop
2. os
3. ISRs with priorities

Each approach should not use any external dependency and be as small and simple as possible.
Open source embedded network library https://mongoose.ws
TCP/IP stack + TLS1.3 + HTTP/WebSocket/MQTT in a single file
 
The following users thanked this post: Siwastaja

Online tggzzz

  • Super Contributor
  • ***
  • Posts: 19517
  • Country: gb
  • Numbers, not adjectives
    • Having fun doing more, with less
Everything just... works, running in parallel. Modules can be treated as "separate programs", and state machine is implemented as just functions, triggered by something - usually HW timer or other peripheral. No problem doing "long" ISRs - just make more important stuff higher priority, and pre-emption works.

I'd like to see a simple project, e.g. blinky with  UART based control (e.g. to change blink intervals, blink counts, or pwm), implemented in 3 paradigms:
1. superloop
2. os
3. ISRs with priorities

Each approach should not use any external dependency and be as small and simple as possible.

While that sounds like good idea, just about any approach, e.g. assembler, is sufficient for a trivial project.

The problem is understanding how architectures do and don't scale as the number of inputs and outputs increases, the number of souces of inputs and outputs increaces, interaction between them becomes trickier, processing complexity increases, asynchronous vs synchronous APIs are encountered, extra requirements are added late in the implementation process, years elapse and people forget, new people are added to a project, etc.

The value of "hello world" and "blinky" projects is that they are a first step in implementing something anything in your chosen architecture, to gain confidence that you have your toolchain working from end to end. They do not and cannot show the strengths/weaknesses of architectures.
There are lies, damned lies, statistics - and ADC/DAC specs.
Glider pilot's aphorism: "there is no substitute for span". Retort: "There is a substitute: skill+imagination. But you can buy span".
Having fun doing more, with less
 

Offline Siwastaja

  • Super Contributor
  • ***
  • Posts: 8179
  • Country: fi
Portability is also a point. If you base your project on the premise of having an interrupt controller with pre-emptive priorized interrupts, then it won't be trivial to port it to a simpler microcontroller that does not have this feature. OS has the benefit that as long as the same OS is ported on the different platform, basic concurrency features work without any porting. Though, the hardware reality still leaks through abstractions: performance can differ a lot depending on what HW resources are available to the OS.
 

Online tellurium

  • Regular Contributor
  • *
  • Posts: 231
  • Country: ua
I actually see a huge value is such a project.

Many people are accustomed to one/two approaches only - they just got familiar with it, and stuck to it even if another approach is better for the task. And I know that working examples is a big deal. That's one of the best learning tools in software engineering.

The main goal is to show the code structure and execution flow, cause THAT is what differentiate approaches, not scalability/portability/whatever. Even implemented on a single arch, e.g. on the ubiquitous STM32 bluepill.
Open source embedded network library https://mongoose.ws
TCP/IP stack + TLS1.3 + HTTP/WebSocket/MQTT in a single file
 

Offline Siwastaja

  • Super Contributor
  • ***
  • Posts: 8179
  • Country: fi
I actually see a huge value is such a project.

Many people are accustomed to one/two approaches only - they just got familiar with it, and stuck to it even if another approach is better for the task. And I know that working examples is a big deal. That's one of the best learning tools in software engineering.

The main goal is to show the code structure and execution flow, cause THAT is what differentiate approaches, not scalability/portability/whatever. Even implemented on a single arch, e.g. on the ubiquitous STM32 bluepill.

It would be a good demonstration, i.e., an explanation in form of code what these paradigms actually mean, but due to points tggzzz raised, obviously not a totally fair comparison. You can't find a "winner" that way.

1 and 3 also tend to mix up to some kind of hybrid. Also note how even within 2 (the OS), there are many options, like superloop/select()/poll() vs. threads.

Personally, I use all three approaches quite equally, but in different types of projects.
 

Offline Nominal Animal

  • Super Contributor
  • ***
  • Posts: 6265
  • Country: fi
    • My home page and email address
And it doesn't need to be a generic 1ms tick, and you don't need to write any kind of scheduler, at all, the CPU can take care of it: just set a timer peripheral for whatever delay you actually need, and make it trigger the next state function directly.
For tasks at the same priority level, a single timer and a binary min-heap to hold the next firing time works well.  I use it frequently in Linux, for all sorts of timeouts.

(For those who are unaware, given uniform random keys, such a binary min-heap only does an average of e ~ 2.7 percolations per insert/deleteMin, the data representation is a simple array, and on an X-bit arch, if you can live with say (X-N)-bit timestamps and use the low N bits to indicate the cause, you can treat the combined value as a X-bit timestamp, i.e. no masking needed.  Whenever the actual interrupt fires, you dispatch all tasks that have elapsed thus far, so the implicit ordering of the N-bit causes among same timestamps only orders the dispatching during the same interrupt, which would usually happen anyway.  The only "dirty" bit is that you need to handle the case when the interrupt would fire almost instantly in the previous interrupt, to cater to the interrupt setup overhead, so there is one "critical" interrupts-disabled section for a few cycles, where you check ARM_DWT_CYCCNT and if it has not yet progressed too far, arm the next interrupt –– note that this assumes a single-shot timer with cheap next firing interval setup.  But otherwise the code is simple, clean, lightweight, and concise to implement.)

As mentioned by others above, this is very much co-operative time-sharing: each invocation must complete within an acceptable number of cpu cycles, and they do not really run in parallel at all.  It is very much an event-driven asynchronous model, which makes things easier if you approach the development from that angle (instead of thread/task/procedural angle).



Now, the only thing one really needs to be able to switch between tasks with the same privileges ("userspace threads") or coroutines, is to switch the stack and the register file.  C does specify setjmp()/longjmp(), but the jmp_buf structure is opaque.

After recent experimentation (and using terminology where the 'bottom' of the stack is where the stack is empty, and 'top' is where new data is added, even for downwards-growing stacks), I've found that if each 'stack' actually consists of register file storage before/just beyond/outside the bottom of the stack, storing the register state when the stack is not in use, it is very compatible with freestanding C/C++, and you get extremely lightweight same-privilege task switches, and you can use a pointer to the 'stack' (really, the register file storage of that stack)' as the task identifier.  Even yield() calls simplify to (store register state to the file before the bottom of the stack, and look for something else to run).  It is not an RTOS, but it might be useful for bare-metal developers for cooperative multitasking and pre-emptive multitasking (if they write their own task scheduler), and also for educational purposes.

You can even add additional fields there, say one describing the priority of whatever the task is doing now (with peripheral accesses bounded by setting that to "critical, please don't yield"), without disabling interrupts.  Then, the scheduler noting that the task is "critical", could simply set another flag, check that the task is not stuck, and let the task progress.  When the next time slice elapses, or the task resets the priority to non-critical, it auto-yields due to the scheduler-set flag.

(You do need to replace some/most of newlib (the base C library you use), though.  Which doesn't bother me, because I do want to replace it with something better; I have a topic about that here somewhere already.)

What has stopped me thus far from even starting this as a proper project, is stack safety/collisions.  I really don't have any tools (except for hardware ones, like MMU or protection registers) to ensure tasks won't overflow their stacks... I just get very unsure when having to just "trust" code to behave well  :-\
 

Offline TC

  • Contributor
  • Posts: 40
  • Country: us
Disclaimer... I haven't read every reply to this topic. But the posted question was about coding state machines.

I found this book to be an excellent resource on the topic...

Practical UML Statecharts in C/C++ - Miro Samek
 

Offline rstofer

  • Super Contributor
  • ***
  • Posts: 9890
  • Country: us
What has stopped me thus far from even starting this as a proper project, is stack safety/collisions.  I really don't have any tools (except for hardware ones, like MMU or protection registers) to ensure tasks won't overflow their stacks... I just get very unsure when having to just "trust" code to behave well  :-\

FreeRTOS can do a stack check before it dispatches a task:

https://www.freertos.org/Stacks-and-stack-overflow-checking.html
 

Offline rstofer

  • Super Contributor
  • ***
  • Posts: 9890
  • Country: us
Everything just... works, running in parallel. Modules can be treated as "separate programs", and state machine is implemented as just functions, triggered by something - usually HW timer or other peripheral. No problem doing "long" ISRs - just make more important stuff higher priority, and pre-emption works.

I'd like to see a simple project, e.g. blinky with  UART based control (e.g. to change blink intervals, blink counts, or pwm), implemented in 3 paradigms:
1. superloop
2. os
3. ISRs with priorities

Each approach should not use any external dependency and be as small and simple as possible.

While not exactly what you want, FreeRTOS comes with a LOT of ports and if the board is also 'mbed' compatible there are even more ports at mbed.org.

Try to pick a chip with an ARM NVIC (Nested Vectored Interrupt Controller) peripheral.  Write your interrupt handlers as simple C functions.

Here is the code to set up the NVIC on an LPC1768

Code: [Select]
   
NVIC_SetVector(SPI_IRQn, (uint32_t) &spi_slave_handler);
NVIC_SetPriority(SPI_IRQn, 1);
NVIC_EnableIRQ(SPI_IRQn);

Here is the beginning code for 'spi_slave_handler()'  Note that it is an ordinary function with no special attributes.

Code: [Select]
void spi_slave_handler(void) {
    unsigned char status;
    unsigned char value;
    ...
 

Offline westfw

  • Super Contributor
  • ***
  • Posts: 4199
  • Country: us
Quote
I'd like to see a simple project, e.g. blinky with  UART based control (e.g. to change blink intervals, blink counts, or pwm), implemented in 3 paradigms
The thing is, "simple projects" are really easy.   Short, well-behaved "tasks" with no resource contention - no problem.
It's when they get more complicated that you run into trouble.

For instance, I recently have had cause to look at the "micros()" implementation on rp2040 in Arduino, using the Philhower core vs the Arduino (MBed) core.
As it turns out, there's a hardware timer that counts microseconds in a 64bit register, so all the code really needs to do is return the low 32bits of that counter.
The Philhower core calls the SDK function to get the 64bit time, and returns 32bits of it.  That's ... fair.

The MBed core has decided that "time" is a protected OS resource, and has multiple levels of "protections" to make sure there are no conflicts in accessing it. There may be additional protections to prevent conflicts between the two CPUs.  As a result, micros() takes about 4 us to execute (on a 120MHz CPU)!  That's AWFUL.  And yet, the problem statement "clock values should be consistent across multiple tasks" isn't an obviously awful prerequisite...
 

Offline brucehoult

  • Super Contributor
  • ***
  • Posts: 4040
  • Country: nz
As a result, micros() takes about 4 us to execute (on a 120MHz CPU)!  That's AWFUL.

Ugh!  I just tried on an Uno, calling micros() 400 times in a loop and storing the result in an unsigned long array (i.e. 4 bytes).

It normally incremented by 4 us each time, but about once in 60-70 times got the same micros() result twice in a row.

Getting a pointer to the array outside the loop increased the "same twice in a row" frequency to once in 16 times. Storing only the low 8 bits of micros() increased the "same twice in a row" frequency to once in 7 times.

So, yeah, definitely being limited by the 16 MHz CPU and needing 4 instructions to move a 32 bit value around. The loop is taking a little less than 64 clock cycles to execute.

Trying exactly the same sketch on the HiFive1 (FE310 RISC-V from 2016), also running at 16 MHz (but a 32 bit CPU), the result of micros() changes by 1 unit every call, except once in 16 calls the same value is returned twice. So the loop takes very slightly less than 1 us to execute -- 15 clock cycles in fact.

With the clock changed to 256 MHz, 376 out of 400 calls to micros() return the same value as the previous call. The micros() value increments by only 23 in 400 calls, so the loop executes in 57.5 ns or 15 clock cycles.

The loop calling micros() looks like:

Code: [Select]
2040015c:       64040993                addi    s3,s0,1600  // note: s0 was already gp-2020, the address of a[]
20400160:       81c18a13                addi    s4,gp,-2020

20400164:       28f1                    jal     20400240 <micros>
20400166:       00aa2023                sw      a0,0(s4)
2040016a:       0a11                    addi    s4,s4,4
2040016c:       ff3a1ce3                bne     s4,s3,20400164 <loop+0x6c>

And micros() looks like:

Code: [Select]
20400240 <micros>:
20400240:       b8002573                csrr    a0,mcycleh
20400244:       b00027f3                csrr    a5,mcycle
20400248:       b8002773                csrr    a4,mcycleh
2040024c:       fee51ae3                bne     a0,a4,20400240 <micros>
20400250:       01851293                slli    t0,a0,0x18
20400254:       0087d313                srli    t1,a5,0x8
20400258:       0062e533                or      a0,t0,t1
2040025c:       8082                    ret

So that gets the 64 bit cycle count, divides it by 256, and returns the lower 32 bits: the hi 24 bits of the lo word, shifted right by 8, combined with the lo 8 bits of the hi word, shifted left by 24 (0x18).

In total, 12 instructions in 15 cycles because the three CSR reads take 2 cycles each.

My Arduino code (identical for Uno and HiFive1):

Code: [Select]
#define N 400
unsigned long a[N];

void setup() {
  pinMode(LED_BUILTIN, OUTPUT);
  Serial.begin(115200);
  delay(100);
  Serial.println("Starting");
}

int first = 1;

void loop() {
  if (first){
    Serial.println("first");
    unsigned long *p = a;
    int dups = 0;
    for (int i=0; i<N; ++i) p[i] = micros();
    for (int i=0; i<N; ++i){
      Serial.println((unsigned long)p[i]);
      if (i>0 && p[i] == p[i-1]){
        Serial.println("====");
        ++dups;
      }
    }
    Serial.print("Dups = ");
    Serial.println(dups);
    first = 0;
  }
  digitalWrite(LED_BUILTIN, HIGH);   // turn the LED on (HIGH is the voltage level)
  delay(1000);                       // wait for a second
  digitalWrite(LED_BUILTIN, LOW);    // turn the LED off by making the voltage LOW
  delay(1000);                       // wait for a second
}
« Last Edit: July 09, 2022, 04:04:32 am by brucehoult »
 
The following users thanked this post: tellurium

Offline westfw

  • Super Contributor
  • ***
  • Posts: 4199
  • Country: us
Quote
I just tried on an Uno ...
It normally incremented by 4 us each time, but about once in 60-70 times got the same micros() result twice in a row.
On an Uno, the micros() function only has a resolution of 4us.  So the function itself (which does 32 math on the timer0 overflow count and the timer register) is taking slightly less than 4us to execute.  60-odd clocks including the loop and store doesn't seem unreasonable.
Code: [Select]
unsigned long micros() {
        unsigned long m;
        uint8_t oldSREG = SREG, t;
 3b8:   3f b7           in      r19, 0x3f       ; 63
       
        cli();
 3ba:   f8 94           cli
        m = timer0_overflow_count;
 3bc:   80 91 39 01     lds     r24, 0x0139     ;  <timer0_overflow_count>
 3c0:   90 91 3a 01     lds     r25, 0x013A     ;  <timer0_overflow_count+0x1>
 3c4:   a0 91 3b 01     lds     r26, 0x013B     ;  <timer0_overflow_count+0x2>
 3c8:   b0 91 3c 01     lds     r27, 0x013C     ;  <timer0_overflow_count+0x3>
        t = TCNT0;
 3cc:   26 b5           in      r18, 0x26       ; 38
        if ((TIFR0 & _BV(TOV0)) && (t < 255))
 3ce:   a8 9b           sbis    0x15, 0 ; 21
 3d0:   05 c0           rjmp    .+10            ; 0x3dc <micros+0x24>
 3d2:   2f 3f           cpi     r18, 0xFF       ; 255
 3d4:   19 f0           breq    .+6             ; 0x3dc <micros+0x24>
                m++;
 3d6:   01 96           adiw    r24, 0x01       ; 1
 3d8:   a1 1d           adc     r26, r1
 3da:   b1 1d           adc     r27, r1
        SREG = oldSREG;
 3dc:   3f bf           out     0x3f, r19       ; 63
        return ((m << 8) + t) * (64 / clockCyclesPerMicrosecond());
 3de:   ba 2f           mov     r27, r26
 3e0:   a9 2f           mov     r26, r25
 3e2:   98 2f           mov     r25, r24
 3e4:   88 27           eor     r24, r24
 3e6:   bc 01           movw    r22, r24
 3e8:   cd 01           movw    r24, r26
 3ea:   62 0f           add     r22, r18
 3ec:   71 1d           adc     r23, r1
 3ee:   81 1d           adc     r24, r1
 3f0:   91 1d           adc     r25, r1
 3f2:   42 e0           ldi     r20, 0x02       ; 2
 3f4:   66 0f           add     r22, r22
 3f6:   77 1f           adc     r23, r23
 3f8:   88 1f           adc     r24, r24
 3fa:   99 1f           adc     r25, r25
 3fc:   4a 95           dec     r20
 3fe:   d1 f7           brne    .-12            ; 0x3f4 <micros+0x3c>
}
 400:   08 95           ret
For the same functionality to take the same time on a 32bit processor with a hardware counter counting in the right units is disgraceful!
 

Offline Nominal Animal

  • Super Contributor
  • ***
  • Posts: 6265
  • Country: fi
    • My home page and email address
What has stopped me thus far from even starting this as a proper project, is stack safety/collisions.  I really don't have any tools (except for hardware ones, like MMU or protection registers) to ensure tasks won't overflow their stacks... I just get very unsure when having to just "trust" code to behave well  :-\
FreeRTOS can do a stack check before it dispatches a task:
So can I, but 1) how much is enough stack space, and 2) that's at only a point in the task time slice.

Making hand-wavy guesses how much stack is actually needed feels.. icky.  I can instrument the stack (fill with 0xdeadbeef) and provide functions to check how far the stack was meddled with.  It shows that I'm used to having an MMU (and virtual memory)!
 

Offline brucehoult

  • Super Contributor
  • ***
  • Posts: 4040
  • Country: nz
Quote
I just tried on an Uno ...
It normally incremented by 4 us each time, but about once in 60-70 times got the same micros() result twice in a row.
On an Uno, the micros() function only has a resolution of 4us.  So the function itself (which does 32 math on the timer0 overflow count and the timer register) is taking slightly less than 4us to execute.

That's ... exactly what I said.

Quote
60-odd clocks including the loop and store doesn't seem unreasonable.

As I said, it needs four instructions to manipulate a 32 bit value, so yes that seems reasonable, for an 8 bit CPU at 16 MHz.

Quote
For the same functionality to take the same time on a 32bit processor with a hardware counter counting in the right units is disgraceful!

Depends on the clock rate.

Taking a little under 1 µs (four times faster) on a 32 bit processor (E31) that is running at the same 16 MHz as the AVR seems also fine.

Taking 4 µs on a 32 bit processor running at 120 MHz would indeed be awful. 480 clock cycles. Hard to see how you could even do that.

The E31, as noted, scales perfectly from 15/16 µs at 16 MHz to 15/256 µs at 256 MHz.
 

Offline Nominal Animal

  • Super Contributor
  • ***
  • Posts: 6265
  • Country: fi
    • My home page and email address
One can often enable some kinds of static as well as dynamic / instrumented analysis / checking to look for stack overflows instead of or in addition to whatever kinds of stack safety may be able to be garnered by use of the MMU or memory / page / whatever protection mechanisms that may exist:
True.  Because I usually work in a hosted (full OS with MMU and virtual memory), I didn't know GCC supports -fstack-limit-symbol=sym, which I can probably synthesize.  Thanks for pointing it out!  Now I have no excuses left...

Welp, I think I next need to do some extensive testing to find out whether on ARM limiting the choice of stack alignments to a power of two (so that by masking the current stack pointer by 2ⁿ-1 yields the smallest allowed stack address), or reserving a register for this (via -ffixed-reg -fstack-limit-register=reg; the register then acts as both the stack limit, and a base address (minus a compile-time constant) to the task structure containing the register file), makes more sense.  It definitely sounds intriguing, and the latter seems much more doable; but again, one must recompile all libraries, including HAL and the C library one uses, with those flags specified for this to work.

Speaking of FSMs, RTOS, embedded systems architecture, et. al.
what kinds of useful things / tools are people here using or are familiar with for related architecture / design pattern / framework / design / implementation etc. relating to system design and elaboration?
Graphviz has become indispensable for me.  The way it just eats human-readable, easily machine generated text, and spits out graphs, directed graphs, flowcharts, and so on, has made a big difference for me.  I use it both in verification/debugging/unit testing, as well as in the actual planning stage.  (Plus, since it is just plain text, you can easily support it in your preferred markdown language.)
« Last Edit: July 09, 2022, 05:56:35 pm by Nominal Animal »
 
The following users thanked this post: evb149

Offline SiliconWizard

  • Super Contributor
  • ***
  • Posts: 14488
  • Country: fr
Graphviz is good, but of course it's fully automated (which is a benefit), and doesn't really allow any hand placement of anything. If you want something more flexible, there is yEd: https://www.yworks.com/products/yed

I use it when I need something that can be edited manually. It also supports a number of automatic placement algorithms, but you can further arrange things manually. It's free, but not open-source, though.
 
The following users thanked this post: evb149

Offline westfw

  • Super Contributor
  • ***
  • Posts: 4199
  • Country: us
Quote
Taking 4 µs on a 32 bit processor running at 120 MHz would indeed be awful. 480 clock cycles. Hard to see how you could even do that.
Here it is in all its ugliness!
micros()->(crit)elapsed_time->(crit)slicetime->(crit)ticker_read_us->initialize/(core_crit)/update_present_time->math
Code: [Select]10004484 <micros>:

unsigned long micros() {
10004484:    b507          push    {r0, r1, r2, lr}
  return timer.elapsed_time().count();
10004486:    4903          ldr    r1, [pc, #12]    ; (10004494 <micros+0x10>)
10004488:    4668          mov    r0, sp
1000448a:    f002 fc80     bl    10006d8e <mbed::TimerBase::elapsed_time() const>
}
1000448e:    9800          ldr    r0, [sp, #0]
10004490:    bd0e          pop    {r1, r2, r3, pc}
10004492:    46c0          nop            ; (mov r8, r8)
10004494:    20000fc8     .word    0x20000fc8

---------

10006d8e <mbed::TimerBase::elapsed_time() const>:
10006d8e:    b530          push    {r4, r5, lr}
10006d90:    000d          movs    r5, r1
10006d92:    b085          sub    sp, #20
10006d94:    0004          movs    r4, r0
10006d96:    a803          add    r0, sp, #12
10006d98:    f002 f820     bl    10008ddc <mbed::CriticalSectionLock::CriticalSectionLock()>
10006d9c:    0029          movs    r1, r5
10006d9e:    4668          mov    r0, sp
10006da0:    f7ff ffda     bl    10006d58 <mbed::TimerBase::slicetime() const>
10006da4:    68a8          ldr    r0, [r5, #8]
10006da6:    68e9          ldr    r1, [r5, #12]
10006da8:    9a00          ldr    r2, [sp, #0]
10006daa:    9b01          ldr    r3, [sp, #4]
10006dac:    1812          adds    r2, r2, r0
10006dae:    414b          adcs    r3, r1
10006db0:    a803          add    r0, sp, #12
10006db2:    6022          str    r2, [r4, #0]
10006db4:    6063          str    r3, [r4, #4]
10006db6:    f002 f817     bl    10008de8 <mbed::CriticalSectionLock::~CriticalSectionLock()>
10006dba:    0020          movs    r0, r4
10006dbc:    b005          add    sp, #20
10006dbe:    bd30          pop    {r4, r5, pc}

-------

10006d58 <mbed::TimerBase::slicetime() const>:
10006d58:    b537          push    {r0, r1, r2, r4, r5, lr}
10006d5a:    0004          movs    r4, r0
10006d5c:    a801          add    r0, sp, #4
10006d5e:    000d          movs    r5, r1
10006d60:    f002 f83c     bl    10008ddc <mbed::CriticalSectionLock::CriticalSectionLock()>
10006d64:    2300          movs    r3, #0
10006d66:    2200          movs    r2, #0
10006d68:    6022          str    r2, [r4, #0]
10006d6a:    6063          str    r3, [r4, #4]
10006d6c:    7d6b          ldrb    r3, [r5, #21]
10006d6e:    2b00          cmp    r3, #0
10006d70:    d008          beq.n    10006d84 <mbed::TimerBase::slicetime() const+0x2c>
10006d72:    6928          ldr    r0, [r5, #16]
10006d74:    f001 fff0     bl    10008d58 <ticker_read_us>
10006d78:    682a          ldr    r2, [r5, #0]
10006d7a:    686b          ldr    r3, [r5, #4]
10006d7c:    1a80          subs    r0, r0, r2
10006d7e:    4199          sbcs    r1, r3
10006d80:    6020          str    r0, [r4, #0]
10006d82:    6061          str    r1, [r4, #4]
10006d84:    a801          add    r0, sp, #4
10006d86:    f002 f82f     bl    10008de8 <mbed::CriticalSectionLock::~CriticalSectionLock()>
10006d8a:    0020          movs    r0, r4
10006d8c:    bd3e          pop    {r1, r2, r3, r4, r5, pc}

------

10008d58 <ticker_read_us>:
10008d58:    b570          push    {r4, r5, r6, lr}
10008d5a:    0004          movs    r4, r0
10008d5c:    f7ff ff2e     bl    10008bbc <initialize>
10008d60:    f000 fc24     bl    100095ac <core_util_critical_section_enter>
10008d64:    0020          movs    r0, r4
10008d66:    f7ff fe41     bl    100089ec <update_present_time>
10008d6a:    6863          ldr    r3, [r4, #4]
10008d6c:    6a9c          ldr    r4, [r3, #40]    ; 0x28
10008d6e:    6add          ldr    r5, [r3, #44]    ; 0x2c
10008d70:    f000 fc32     bl    100095d8 <core_util_critical_section_exit>
10008d74:    0029          movs    r1, r5
10008d76:    0020          movs    r0, r4
10008d78:    bd70          pop    {r4, r5, r6, pc}

-------

100089ec <update_present_time>:
100089ec:    b5f7          push    {r0, r1, r2, r4, r5, r6, r7, lr}
100089ee:    6846          ldr    r6, [r0, #4]
100089f0:    0033          movs    r3, r6
100089f2:    3332          adds    r3, #50    ; 0x32
100089f4:    781c          ldrb    r4, [r3, #0]
100089f6:    2c00          cmp    r4, #0
100089f8:    d132          bne.n    10008a60 <update_present_time+0x74>
100089fa:    6803          ldr    r3, [r0, #0]
100089fc:    685b          ldr    r3, [r3, #4]
100089fe:    4798          blx    r3
10008a00:    6a32          ldr    r2, [r6, #32]
10008a02:    0003          movs    r3, r0
10008a04:    4282          cmp    r2, r0
10008a06:    d02b          beq.n    10008a60 <update_present_time+0x74>
10008a08:    1a82          subs    r2, r0, r2
10008a0a:    6930          ldr    r0, [r6, #16]
10008a0c:    6233          str    r3, [r6, #32]
10008a0e:    4010          ands    r0, r2
10008a10:    2233          movs    r2, #51    ; 0x33
10008a12:    56b2          ldrsb    r2, [r6, r2]
10008a14:    2a00          cmp    r2, #0
10008a16:    db24          blt.n    10008a62 <update_present_time+0x76>
10008a18:    0021          movs    r1, r4
10008a1a:    f7f7 feb1     bl    10000780 <__aeabi_llsl>
10008a1e:    2734          movs    r7, #52    ; 0x34
10008a20:    57f7          ldrsb    r7, [r6, r7]
10008a22:    0004          movs    r4, r0
10008a24:    000d          movs    r5, r1
10008a26:    2f00          cmp    r7, #0
10008a28:    d014          beq.n    10008a54 <update_present_time+0x68>
10008a2a:    2100          movs    r1, #0
10008a2c:    6a70          ldr    r0, [r6, #36]    ; 0x24
10008a2e:    1824          adds    r4, r4, r0
10008a30:    414d          adcs    r5, r1
10008a32:    9400          str    r4, [sp, #0]
10008a34:    9501          str    r5, [sp, #4]
10008a36:    428f          cmp    r7, r1
10008a38:    db1b          blt.n    10008a72 <update_present_time+0x86>
10008a3a:    003a          movs    r2, r7
10008a3c:    0020          movs    r0, r4
10008a3e:    0029          movs    r1, r5
10008a40:    f7f7 fe92     bl    10000768 <__aeabi_llsr>
10008a44:    003a          movs    r2, r7
10008a46:    0004          movs    r4, r0
10008a48:    000d          movs    r5, r1
10008a4a:    f7f7 fe99     bl    10000780 <__aeabi_llsl>
10008a4e:    9b00          ldr    r3, [sp, #0]
10008a50:    1a18          subs    r0, r3, r0
10008a52:    6270          str    r0, [r6, #36]    ; 0x24
10008a54:    6ab2          ldr    r2, [r6, #40]    ; 0x28
10008a56:    6af3          ldr    r3, [r6, #44]    ; 0x2c
10008a58:    1912          adds    r2, r2, r4
10008a5a:    416b          adcs    r3, r5
10008a5c:    62b2          str    r2, [r6, #40]    ; 0x28
10008a5e:    62f3          str    r3, [r6, #44]    ; 0x2c
10008a60:    bdf7          pop    {r0, r1, r2, r4, r5, r6, r7, pc}
10008a62:    68b1          ldr    r1, [r6, #8]
10008a64:    0002          movs    r2, r0
10008a66:    0023          movs    r3, r4
10008a68:    0008          movs    r0, r1
10008a6a:    0021          movs    r1, r4
10008a6c:    f7f7 ff34     bl    100008d8 <__aeabi_lmul>
10008a70:    e7d5          b.n    10008a1e <update_present_time+0x32>
10008a72:    68f7          ldr    r7, [r6, #12]
10008a74:    000b          movs    r3, r1
10008a76:    9800          ldr    r0, [sp, #0]
10008a78:    9901          ldr    r1, [sp, #4]
10008a7a:    003a          movs    r2, r7
10008a7c:    f7f7 ff0c     bl    10000898 <__aeabi_uldivmod>
10008a80:    4347          muls    r7, r0
10008a82:    9b00          ldr    r3, [sp, #0]
10008a84:    0004          movs    r4, r0
10008a86:    1bdf          subs    r7, r3, r7
10008a88:    000d          movs    r5, r1
10008a8a:    6277          str    r7, [r6, #36]    ; 0x24
10008a8c:    e7e2          b.n    10008a54 <update_present_time+0x68>

-------

10008ddc <mbed::CriticalSectionLock::CriticalSectionLock()>:
10008ddc:    b510          push    {r4, lr}
10008dde:    0004          movs    r4, r0
10008de0:    f000 fbe4     bl    100095ac <core_util_critical_section_enter>
10008de4:    0020          movs    r0, r4
10008de6:    bd10          pop    {r4, pc}

Disassembly of section .text._ZN4mbed19CriticalSectionLockD2Ev:

10008de8 <mbed::CriticalSectionLock::~CriticalSectionLock()>:
10008de8:    b510          push    {r4, lr}
10008dea:    0004          movs    r4, r0
10008dec:    f000 fbf4     bl    100095d8 <core_util_critical_section_exit>
10008df0:    0020          movs    r0, r4
10008df2:    bd10          pop    {r4, pc}

-------

100095ac <core_util_critical_section_enter>:
100095ac:    b510          push    {r4, lr}
100095ae:    f7ff f979     bl    100088a4 <hal_critical_section_enter>
100095b2:    4a06          ldr    r2, [pc, #24]    ; (100095cc <core_util_critical_section_enter+0x20>)
100095b4:    6813          ldr    r3, [r2, #0]
100095b6:    1c59          adds    r1, r3, #1
100095b8:    d104          bne.n    100095c4 <core_util_critical_section_enter+0x18>
100095ba:    223f          movs    r2, #63    ; 0x3f
100095bc:    4904          ldr    r1, [pc, #16]    ; (100095d0 <core_util_critical_section_enter+0x24>)
100095be:    4805          ldr    r0, [pc, #20]    ; (100095d4 <core_util_critical_section_enter+0x28>)
100095c0:    f7ff ff46     bl    10009450 <mbed_assert_internal>
100095c4:    3301          adds    r3, #1
100095c6:    6013          str    r3, [r2, #0]
100095c8:    bd10          pop    {r4, pc}
100095ca:    46c0          nop            ; (mov r8, r8)
100095cc:    2000aa4c     .word    0x2000aa4c
100095d0:    10013a2f     .word    0x10013a2f
100095d4:    10013a59     .word    0x10013a59

Disassembly of section .text.core_util_critical_section_exit:

100095d8 <core_util_critical_section_exit>:
100095d8:    4a05          ldr    r2, [pc, #20]    ; (100095f0 <core_util_critical_section_exit+0x18>)
100095da:    b510          push    {r4, lr}
100095dc:    6813          ldr    r3, [r2, #0]
100095de:    2b00          cmp    r3, #0
100095e0:    d005          beq.n    100095ee <core_util_critical_section_exit+0x16>
100095e2:    3b01          subs    r3, #1
100095e4:    6013          str    r3, [r2, #0]
100095e6:    2b00          cmp    r3, #0
100095e8:    d101          bne.n    100095ee <core_util_critical_section_exit+0x16>
100095ea:    f7ff f96f     bl    100088cc <hal_critical_section_exit>
100095ee:    bd10          pop    {r4, pc}
100095f0:    2000aa4c     .word    0x2000aa4c

-------

100088a4 <hal_critical_section_enter>:
100088a4:    b510          push    {r4, lr}
100088a6:    f3ef 8010     mrs    r0, PRIMASK
100088aa:    b672          cpsid    i
100088ac:    4a05          ldr    r2, [pc, #20]    ; (100088c4 <hal_critical_section_enter+0x20>)
100088ae:    7813          ldrb    r3, [r2, #0]
100088b0:    2b00          cmp    r3, #0
100088b2:    d105          bne.n    100088c0 <hal_critical_section_enter+0x1c>
100088b4:    2101          movs    r1, #1
100088b6:    000c          movs    r4, r1
100088b8:    4b03          ldr    r3, [pc, #12]    ; (100088c8 <hal_critical_section_enter+0x24>)
100088ba:    4384          bics    r4, r0
100088bc:    701c          strb    r4, [r3, #0]
100088be:    7011          strb    r1, [r2, #0]
100088c0:    bd10          pop    {r4, pc}
100088c2:    46c0          nop            ; (mov r8, r8)
100088c4:    2000adff     .word    0x2000adff
100088c8:    2000adfa     .word    0x2000adfa

Disassembly of section .text.hal_critical_section_exit:

100088cc <hal_critical_section_exit>:
100088cc:    b510          push    {r4, lr}
100088ce:    f3ef 8210     mrs    r2, PRIMASK
100088d2:    2301          movs    r3, #1
100088d4:    4393          bics    r3, r2
100088d6:    d004          beq.n    100088e2 <hal_critical_section_exit+0x16>
100088d8:    2236          movs    r2, #54    ; 0x36
100088da:    4906          ldr    r1, [pc, #24]    ; (100088f4 <hal_critical_section_exit+0x28>)
100088dc:    4806          ldr    r0, [pc, #24]    ; (100088f8 <hal_critical_section_exit+0x2c>)
100088de:    f000 fdb7     bl    10009450 <mbed_assert_internal>
100088e2:    4a06          ldr    r2, [pc, #24]    ; (100088fc <hal_critical_section_exit+0x30>)
100088e4:    7013          strb    r3, [r2, #0]
100088e6:    4b06          ldr    r3, [pc, #24]    ; (10008900 <hal_critical_section_exit+0x34>)
100088e8:    781b          ldrb    r3, [r3, #0]
100088ea:    2b00          cmp    r3, #0
100088ec:    d000          beq.n    100088f0 <hal_critical_section_exit+0x24>
100088ee:    b662          cpsie    i
100088f0:    bd10          pop    {r4, pc}
100088f2:    46c0          nop            ; (mov r8, r8)
100088f4:    10013620     .word    0x10013620
100088f8:    10013651     .word    0x10013651
100088fc:    2000adff     .word    0x2000adff
10008900:    2000adfa     .word    0x2000adfa
 


Share me

Digg  Facebook  SlashDot  Delicious  Technorati  Twitter  Google  Yahoo
Smf