Do i really need to use an RTOS? (Alternatives to finite state machines?)

#125 Reply
Posted by brucehoult on 10 Jul, 2022 09:42
Quote from: westfw on 10 Jul, 2022 08:57
Quote
Taking 4 µs on a 32 bit processor running at 120 MHz would indeed be awful. 480 clock cycles. Hard to see how you could even do that.
Here it is in all its ugliness!

micros()->(crit)elapsed_time->(crit)slicetime->(crit)ticker_read_us->initialize/(core_crit)/update_present_time->math

omg. It's worse than that!

Cutting out the non-call stuff...

Code: [Select]
unsigned long micros() { 1000448a: f002 fc80 bl 10006d8e <mbed::TimerBase::elapsed_time() const> 10006d8e <mbed::TimerBase::elapsed_time() const>: 10006d98: f002 f820 bl 10008ddc <mbed::CriticalSectionLock::CriticalSectionLock()> 10006da0: f7ff ffda bl 10006d58 <mbed::TimerBase::slicetime() const> 10006db6: f002 f817 bl 10008de8 <mbed::CriticalSectionLock::~CriticalSectionLock()> 10006d58 <mbed::TimerBase::slicetime() const>: 10006d60: f002 f83c bl 10008ddc <mbed::CriticalSectionLock::CriticalSectionLock()> 10006d74: f001 fff0 bl 10008d58 <ticker_read_us> 10006d86: f002 f82f bl 10008de8 <mbed::CriticalSectionLock::~CriticalSectionLock()> 10008d58 <ticker_read_us>: 10008d5c: f7ff ff2e bl 10008bbc <initialize> 10008d60: f000 fc24 bl 100095ac <core_util_critical_section_enter> 10008d66: f7ff fe41 bl 100089ec <update_present_time> 10008d70: f000 fc32 bl 100095d8 <core_util_critical_section_exit> 100089ec <update_present_time>: 100089fa: 6803 ldr r3, [r0, #0] 100089fc: 685b ldr r3, [r3, #4] 100089fe: 4798 blx r3 10008a1a: f7f7 feb1 bl 10000780 <__aeabi_llsl> 10008a40: f7f7 fe92 bl 10000768 <__aeabi_llsr> 10008a4a: f7f7 fe99 bl 10000780 <__aeabi_llsl> 10008a6c: f7f7 ff34 bl 100008d8 <__aeabi_lmul> 10008a7c: f7f7 ff0c bl 10000898 <__aeabi_uldivmod> 10008ddc <mbed::CriticalSectionLock::CriticalSectionLock()>: 10008de0: f000 fbe4 bl 100095ac <core_util_critical_section_enter> 10008de8 <mbed::CriticalSectionLock::~CriticalSectionLock()>: 10008dec: f000 fbf4 bl 100095d8 <core_util_critical_section_exit> 100095ac <core_util_critical_section_enter>: 100095ae: f7ff f979 bl 100088a4 <hal_critical_section_enter> 100095c0: f7ff ff46 bl 10009450 <mbed_assert_internal> 100095d8 <core_util_critical_section_exit>: 100095ea: f7ff f96f bl 100088cc <hal_critical_section_exit> 100088a4 <hal_critical_section_enter>: 100088a6: f3ef 8010 mrs r0, PRIMASK 100088aa: b672 cpsid i 100088cc <hal_critical_section_exit>: 100088ce: f3ef 8210 mrs r2, PRIMASK 100088de: f000 fdb7 bl 10009450 <mbed_assert_internal> 100088ee: b662 cpsie i
So ...

mbed::CriticalSectionLock::CriticalSectionLock and core_util_critical_section_enter and hal_critical_section_enter are all essentially the same thing. Same with the corresponding exit/leave.

elapsed_time() puts a lock/unlock around calling slicetime()
slicetime() puts a lock/unlock around calling ticker_read_us()
ticker_read_us() puts a lock/unlock around calling update_present_time()

update_present_time() probably spends most of its cycles in __aeabi_lmul and __aeabi_uldivmod. That may not be avoidable.

It would be useful to :

- have slicetime_with_lock_held() and ticker_read_us_with_lock_held() functions so you don't have to have multiple levels of lock taking. This would also enable inlining them.

- NOT compile with -O0. There is far too much stuff that should simply be inlined, and too much memory traffic.

But I bet __aeabi_lmul and __aeabi_uldivmod are using a significant chunk of the time anyway. The __aeabi_llsl and __aeabi_llsr are hopefully fast.

BUT WHY ON EARTH DO MANUFACTURERS INSIST ON WRITING HAL CODE SO INEFFICIENTLY? Are they trying to upsell faster chips or something?

Maybe they just all have new grads writing this stuff.

#126 Reply
Posted by AVI-crak on 10 Jul, 2022 12:03
Quote from: tggzzz on 12 Jun, 2022 13:32
As has been known sinc the 60s, mutexes and semaphores are the fundamental mechanism necessary for RTOSs.
OS without mutexes and semaphores is possible!!! It is enough to allow all tasks to be executed in a single cycle. To adjust the power of each task - use the percentage of time. For example, the task is performed 1/50 of the norm, or 1/2, or 1.0. Time is cut like spaghetti, switching is unconditional. Each task is obliged to give time, if there is nothing to do, otherwise there will be brakes. The switching speed is very fast, it allows you to eliminate many conditions, checks, and confirmations - just by allowing the task to check its state. I guarantee that simple code in a task can run much faster than a generic outer piece of code.
A similar principle is implemented in Segger OS, but I don't like closed projects.
https://github.com/AVI-crak/Rtos_cortex

#127 Reply
Posted by DiTBho on 10 Jul, 2022 12:30
Quote from: Nominal Animal on 08 Jul, 2022 18:56
For tasks at the same priority level, a single timer and a binary min-heap to hold the next firing time works well

That's the same solution I am using in hardware for managing some simple devices.
Not the best, but it's the simplest solution to be implemented and tested, and it does the job!

#128 Reply
Posted by brucehoult on 10 Jul, 2022 13:02
Quote from: DiTBho on 10 Jul, 2022 12:30
Quote from: Nominal Animal on 08 Jul, 2022 18:56
For tasks at the same priority level, a single timer and a binary min-heap to hold the next firing time works well

That's the same solution I am using in hardware for managing some simple devices.
Not the best, but it's the simplest solution to be implemented and tested, and it does the job!

Why is it not the best? What on earth else can you do (in software) that is better? And not only for tasks at the same priority level.

With a suitable definition of the keys being compared, including fields for nominal priority, effective priority (highest priority task that is waiting on a mutex you hold), and adjustments for recent amount of CPU time allocated (for fairness) I don't believe there is anything better at scaling from a couple of tasks to thousands.

#129 Reply
Posted by DiTBho on 10 Jul, 2022 13:06
Quote from: AVI-crak on 10 Jul, 2022 12:03
OS without mutexes and semaphores is possible

Yup, it's possible. There are several ways to solve the problem, as well as constraints and complexity.
- Sequential consistency -> good for implementers and programmers but offers low performance
- total store ordering (aka TSO, RISC-V RVWMO) works great for programmers, hard for implementers
- Like IBM POWER way or like Nvidia GPUs way -> simple for implementers, Hard for programmers
- Other ways ... ?
Yup, there are several memory consistency models(1), a wide range of memory models, as well as there is a big cliff with IBM and Nvidia called "multi-copy-atomicity"

Anyway, the pipeline of machines like POWER10 and PowerPC is full of problems regarding the synchronization; RISC-V has some interesting approaches (from Weak Memory Ordering RVWMO, to extensions like Zam and Ztso), all interesting and they make life simpler for the software programmer (a nightwear for the HDL designer), but personally I am researching on "transaction memory consistency model"(2).

All of this, because I prefer Mutexes and Semaphores, but also monitors if the language can give an hand on them.

(1) wait, what is "memory consistency model" ?
Well, something that at least specifies the values that can be returned by loads during concurrency . Mutexes and Semaphores are both sensible to this.

(2) wait, and what is "transaction memory consistency model ?
Well, it's like (1), but it also specifies what is permitted and what is forbidden(3), the order, and how loads are presented to other units. Basically it adds metadata and time-constraints to the hardware units used to implement Mutexes and Semaphores

(3) wait, what?
An implementation can do anything it wants under the covers, as long as the load return values satisfy RVWMO, so implementations can speculate past a lot of these rules, as long as they make sure to, e.g., squash and replay whenever the violation might actually become observable

At ISA level there are two common modeling approaches: axiomatic, which defines a set of criteria to be satisfied, and operational, which defines a "golden holy abstract machine model" where some executions are forbidden unless producible when executing this model.

I am with the "operational approach" because it offers less of gray area and obscure code this way.

#130 Reply
Posted by DiTBho on 10 Jul, 2022 13:37
Quote from: brucehoult on 10 Jul, 2022 13:02
Why is it not the best?

Misunderstand, I was not talking about software, but rather about hardware (FPGA project) for which I have assumed devices with the "same" priority level in order to simply the implementation.

So the current approach is a very simple implementation which can potentially saturate the used FIFOs and some of the devices can also have significant latency.

To overcome, I have simply over-dimension-ed everything to have the job done decently.

A better approach should also consider that faster devices with long data-stream should be served with higher priority (especially if the FIFO of the other devices is empty), while slower devices can tolerate if they sometimes miss their turn.

This may require preemption, which may cause priority inversion, a bug that occurs when a high priority task is indirectly preempted by a low priority task.

So, you understand why last week I didn't implement preemption, and why I didn't implement different priorities.

I have a new approach in mind, it will requires for sure effort and more circuits, and there are things to be careful, but if done properly as benefit it will reduce the fpga clock frequency and the size of the FIFOs.

#131 Reply
Posted by brucehoult on 10 Jul, 2022 14:47
Quote from: DiTBho on 10 Jul, 2022 13:37
This may require preemption, which may cause priority inversion, a bug that occurs when a high priority task is indirectly preempted by a low priority task.

So, you understand why last week I didn't implement preemption, and why I didn't implement different priorities.

I know what priority inversion is, and my previous message already explains the solution to it.

#132 Reply
Posted by snarkysparky on 10 Jul, 2022 15:28
This is hella good thread guys. Learning a lot.

One time i created two work fifos with different priorities. Interrupts placed tasks to be done in the proper fifos.

When a task finished i checked the high priority fifo for items to process first and if none there then checked the normal priority fifo.

It seemed to work fairly well.

#133 Reply
Posted by tggzzz on 10 Jul, 2022 17:10
Quote from: snarkysparky on 10 Jul, 2022 15:28
This is hella good thread guys. Learning a lot.

One time i created two work fifos with different priorities. Interrupts placed tasks to be done in the proper fifos.

When a task finished i checked the high priority fifo for items to process first and if none there then checked the normal priority fifo.

It seemed to work fairly well.

That can indeed be good, for several reasons. However, beginners might confuse your "task" with and RTOS task; I prefer to call what's in a FIFO a "Job" or "JobFragment".

It can expand beyond interrupts enqueing items in FIFOs.

Monitoring and management is greatly simplified by simply observing the number of items in each queue.

Your execution engine can run at a single priority, thus simplifying prediction and analysis.

It makes minimal presumptions about what facilities are available in the environment (hardware and software), so portability is easier.

You can emulate and test the system in another environment, e.g. a workstation with excellent debugging facilties. All you have to do is create something that pretends to be an interrupt, and puts things in a FIFO.

#134 Reply
Posted by SiliconWizard on 10 Jul, 2022 18:50
Without mutexes? Sure, never share any memory or other resources between tasks. Without semaphores? Uh, sure, if you have other means of synchronizing tasks, or if you don't even need to synchronize them (which is pretty rare.)

Both can be "solved" using message passing only. Not that this is necessarily the most efficient in all use cases, but it works.

#135 Reply
Posted by tggzzz on 10 Jul, 2022 19:59
Quote from: SiliconWizard on 10 Jul, 2022 18:50
Without mutexes? Sure, never share any memory or other resources between tasks. Without semaphores? Uh, sure, if you have other means of synchronizing tasks, or if you don't even need to synchronize them (which is pretty rare.)

Both can be "solved" using message passing only. Not that this is necessarily the most efficient in all use cases, but it works.

Naked mutexes and shared memory are a fruitful source of errors as applications and the programming team scale and mutate over the years.

A lot has been learned about good design patterns for real time and distributed[1] programming over the decades. While the GoF patterns seem to be taught in university courses, the real time amd distributed patterns aren't. Curiously Doug Lea's selection and implementation of them is useful. Unfortinately typical embedded programmers are so hidebound they will reject them out of hand, without understanding them.

Message passing has many advantages. I believe it is the High Performance Computing mob's preferred mechanism.

[1] Most real time applications are distributed, even if one project is implementing functionality on one computer.

#136 Reply
Posted by SiliconWizard on 10 Jul, 2022 20:24
I made a whole thread about message passing a while ago. As I remember, it was an interesting discussion but got quite some "resistance" and a few misconceptions.
I am increasingly resorting to message passing schemes these days. They are much, much easier to get right. Whether they yield better performance than other approaches all depends on a number of factors though, and without at least some hardware support, message passing can be inefficient (in terms of throughput/latency.)

Interestingly, out of security reasons, there's a trend with large applications to distribute the work over a number of *processes* (rather than just threads). In which case, communicating between them requires some form of IPC, so that's often close to message passing.

On general-purpose OSs, processes tend to be a bit heavy though. So there would surely be some benefit for intermediate execution units, some kind of lightweight processes. Some languages (usually through a runtime) and some particular OSs do offer that, but that's still relatively rare.

#137 Reply
Posted by westfw on 10 Jul, 2022 21:15
[rp2040 MBed micros() code]

Quote
It would be useful to :
- NOT compile with -O0. There is far too much stuff that should simply be inlined, and too much memory traffic.
I'm pretty sure that this was NOT compiled with -O0. But MBed is "modular", so that the "time" modules and the "ticker" modules and the "critical section" stuff are in separate areas (the critical Section stuff is done via a C++ constructor - are those ever in-lined?) And it's all distributed as pre-compiled library, so "link time optimization" doesn't help.

Quote
BUT WHY ON EARTH DO MANUFACTURERS INSIST ON WRITING HAL CODE SO INEFFICIENTLY?
There is no "manufacturer" code in there. (I guess unless you can blame NXP for some of the original MBed architecture.) I can't even see where actual rp2040 timer registers are accessed. The rp2040 SDK (from RPi) provides a nice time_us_64() function that is about as efficient as I could ask for (and a time_us_32() function, too.)

It's all MBed and its usage. That's why I brought it up in this RTOS-related discussion. An RTOS aimed at wide distribution will be designed for "portability" - "just change these few bits of code to support your particular chip and it'll all be good!" "Just use this abstracted time API and your application will run on any chip that supports MBed!"
That's a danger of "an RTOS" - that somewhere someone has been a bit lazy on thinking things through, and it ends up shooting you in the foot because you weren't expecting it to be 'so bad'; you use an RTOS to avoid certain problems, and you end up with new problems... You have to pay attention.

#138 Reply
Posted by tggzzz on 10 Jul, 2022 21:38
Quote from: SiliconWizard on 10 Jul, 2022 20:24
I made a whole thread about message passing a while ago. As I remember, it was an interesting discussion but got quite some "resistance" and a few misconceptions.
I am increasingly resorting to message passing schemes these days. They are much, much easier to get right.

Yes. In spades. That should be Writ Large, in Bold Font.

Quote
Whether they yield better performance than other approaches all depends on a number of factors though, and without at least some hardware support, message passing can be inefficient (in terms of throughput/latency.)

Agreed, again.

The classic tradeoff is between message size and processing duration. There is a tendency that if you want to maximise the processing per message, then you have to convey more context in the message, i.e. a take more time sending/receiving the message.

Quote
Interestingly, out of security reasons, there's a trend with large applications to distribute the work over a number of *processes* (rather than just threads). In which case, communicating between them requires some form of IPC, so that's often close to message passing.

On general-purpose OSs, processes tend to be a bit heavy though. So there would surely be some benefit for intermediate execution units, some kind of lightweight processes. Some languages (usually through a runtime) and some particular OSs do offer that, but that's still relatively rare.

Again, yes in spades. This is becoming embarassing

Message passing works well at many scales, and enables easy scalability without changing the application architecture - all the way from single core processors, through multicore processors, through many processors in the same room, to processors distributed in the cloud and owned by more than one company.

All you have to do is specify an API in terms of the message contents and sequences of messages. That's how telecoms systems work, and they are arguably the biggest computing systems mankind uses.

#139 Reply
Posted by tellurium on 10 Jul, 2022 22:48
Quote from: tggzzz on 10 Jul, 2022 21:38
All you have to do is specify an API in terms of the message contents and sequences of messages. That's how telecoms systems work, and they are arguably the biggest computing systems mankind uses.

I don't know about the telco system complexity, but I have some idea about the modern search/ads systems, which are incredibly complex and big. With dozens of thousands of engineers working on them non-stop. I always thought that those are the most complex/big computing systems in the world. Maybe telco systems are even bigger and more complex.

#140 Reply
Posted by westfw on 10 Jul, 2022 23:41
Consider:
Code: [Select]
void show_connections() { for (conn = connection_ll_head; conn; conn = conn->next) { // Traverse our linkedlist printf("Name: %s, dest: %s, prot %s\n", conn->name, conn->dest, conn->protocol); printf(" DataIn: %d, DataOut %d", conn->incount, conn->outcount); } }It doesn't look very dangerous, does it? But it is! Worse, "how to fix" is unclear - presumably actually maintaining the list of connections is more important than displaying it, so you don't really want to lock either the list or the individual connection data structures.
It's dangerous even on a cooperative multi-tasking system, because printf() is a thing that is likely to block.

#141 Reply
Posted by brucehoult on 11 Jul, 2022 00:07
Quote from: westfw on 10 Jul, 2022 23:41
Consider:
Code: [Select]
void show_connections() { for (conn = connection_ll_head; conn; conn = conn->next) { // Traverse our linkedlist printf("Name: %s, dest: %s, prot %s\n", conn->name, conn->dest, conn->protocol); printf(" DataIn: %d, DataOut %d", conn->incount, conn->outcount); } }It doesn't look very dangerous, does it? But it is! Worse, "how to fix" is unclear - presumably actually maintaining the list of connections is more important than displaying it, so you don't really want to lock either the list or the individual connection data structures.
It's dangerous even on a cooperative multi-tasking system, because printf() is a thing that is likely to block.

Lock the list, traverse it while only copying each conn into a local list/array, then unlock it.

Traverse your local copy, checking each conn to see if it is still valid, and if so then print it.

This is a perfect example of why you want to use automatic garbage collection, whether a tracing one or reference-counting, because you don't know who will have the last reference to a conn.

I'm curious what is the Rustacean way to do this (or other static ownership cult)

#142 Reply
Posted by tellurium on 11 Aug, 2022 13:26
Quote from: tellurium on 08 Jul, 2022 10:47
I'd like to see a simple project, e.g. blinky with UART based control (e.g. to change blink intervals, blink counts, or pwm), implemented in 3 paradigms:
1. superloop
2. os
3. ISRs with priorities

I actually took a stab and decided to do it, using some micro I have no prior experience with. Glanced over Mouser for an inexpensive devboard with built-in debugger, and chose an XMC 2 Go devboard, https://www.mouser.ie/ProductDetail/Infineon-Technologies/KIT_XMC_2GO_XMC1100_V1?qs=sGAEpiMZZMuqBwn8WqcFUm6zcOGmCo2HUOc17QP4cY7MLfWzpXwh5g%3D%3D by Infineon. It it a nice tiny board with 2 LEDs and a built-in debugger.

So far so good. I have read the manual, got surprised by Infineon's "universal serial" peripheral along the way, and put together the "first approach" - superloop - implementation that just blinks an LED, using only GCC, make and Jlink. All bare metal, everything hand-written and as small as possible: https://github.com/cpq/xmc1100

Then I have noticed that the board heats up significantly... That was surprising. And after some time, an LED stopped blinking - apparently, it blew up! WTF? I can see only two things to blame - either my code gave too much stress to the board, or the LED did not have a resistor to limit the current. I can hardly believe in (2), but I cannot see anything wrong with the code either, it looks innocent.

So there I have it, the effort ended on the very start.

#143 Reply
Posted by JPortici on 11 Aug, 2022 14:54
Quote from: tellurium on 11 Aug, 2022 13:26
Then I have noticed that the board heats up significantly... That was surprising. And after some time, an LED stopped blinking - apparently, it blew up! WTF? I can see only two things to blame - either my code gave too much stress to the board, or the LED did not have a resistor to limit the current. I can hardly believe in (2), but I cannot see anything wrong with the code either, it looks innocent.

So there I have it, the effort ended on the very start.

I -allegedly- know one of the contractors that fabricates demoboards for infineon and it wouldn't surprise me. about 10% of the boards we had produced from that contractor were defective and required rework (wrong component, wrong orientation, board defective, component not soldered realiably)

#144 Reply
Posted by peter-h on 11 Aug, 2022 14:55
Quote
BUT WHY ON EARTH DO MANUFACTURERS INSIST ON WRITING HAL CODE SO INEFFICIENTLY

Off topic, but "ST HAL" code is auto generated. To make their software a bit less horrible, they don't make it very granular, so they do it so that the generated code checks for all kinds of config options (by reading back config register bits (!!), or by reading members of one of the countless structs they set up) and then jumps around a load of code.

On the wider topic of RTOS code overhead, I have frequently traced through this stuff, and there isn't much. At 168MHz it is of the order of 1-2us (FreeRTOS). There doesn't need to be much; I wrote some simple RTOSes myself many years ago. Similarly, mutexes are 1-2us (I timed some). What takes quite a bit longer is using message queues; a lot of people regard these as some kind of "RTOS purism", and the hallmark of a professional coder, and they are atomic which is nice, and they probably are delivered in a guaranteed order, but there is a lot more code. And the space for them may well be allocated from a private heap...

But the real issue here is that the software architecture needs to be appropriate for the job. Let's say you are doing high speed 3 phase motor control, and sophisticated algos too, doing ADC sampling at say 1MHz. No way will any RTOS get around fast enough for that to work. You will need a dedicated ADC subsystem, with ADC sampling driven by timer, then interrupts, and very short ISRs. Possibly, use ADC -> DMA because reading RAM is a lot faster than reading ADC registers. Then maybe have an RTOS for the background stuff (like an HTTP server for local config ).

#145 Reply
Posted by tggzzz on 11 Aug, 2022 16:00
For "crap" read slow, flaky, unmaintainable, correct, incorrect, tested, verified, validated, proven, as appropriate.

Some people and companies knowingly generate crap.
Some people and companies unknowingly generate crap.
Some people and companies don't care if they generate crap.
Some people and companies don't generate crap.

When implementing something using libraries (and to a lesser extent hardware), you need to understand
- the meaning of "crap"
- what type of people and companies created the components
When designing and implementing a component, you need to defend against things not working as you think they do.

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

There was an error while thanking

Thanking...

Go to page:

« 1 2 3 4 5 6 All

Full site Menu

Navigation

Powered by SMFPacks Advanced Attachments Uploader Mod