Author Topic: FreeRTOS performance penalty (Read 32384 times)

Kjelt · « **Reply #25 on:** September 18, 2014, 12:40:21 pm »

Quote from: dannyf on September 18, 2014, 11:01:56 am

However, it is tricky to ............."share" peripherals in an RTOS environment.

why is this more difficult than in a superloop? You can have both tasks using the same peripheral using a semaphore thus blocking when "in use" and directly starting when "free". While in a superloop you also have to take care of this with your own global flag and the next task (worst case) only gets started at the next round of the loop. Or do you mean something else?

dannyf · « **Reply #26 on:** September 18, 2014, 03:40:09 pm »

I was more thinking about managing the requests via a queue and dealing with the overflow / underflow of that queue.

As to applyign (Free)RTOS to a low-spec mcu (like an 8-bitter). If the context switching takes away 300 instructions per switch, and you switch at 1ms intervals. That along would be 1000 * 300 instructions = 0.3MIPS alone on switching.

You probably need 3MIPS+ to make it less noticeable -> thus my 4MIPS figure provided earlier.

Obviously, you can greatly improve the efficiency by switching less frequently, like every 10ms. However, that may not be fast enough for some "real time" applications.

Fortunately, most modern MCUs run faster than 4MIPS. It is flash / sram space that is more constraining.

tggzzz · « **Reply #27 on:** September 18, 2014, 04:13:24 pm »

Quote from: Kjelt on September 18, 2014, 12:40:21 pm

Quote from: dannyf on September 18, 2014, 11:01:56 am
However, it is tricky to ............."share" peripherals in an RTOS environment.
why is this more difficult than in a superloop? You can have both tasks using the same peripheral using a semaphore thus blocking when "in use" and directly starting when "free". While in a superloop you also have to take care of this with your own global flag and the next task (worst case) only gets started at the next round of the loop. Or do you mean something else?

For any small-scale simple microbenchmark or microapplication, a "superloop" is probably the best thing. I've used them myself.

Once the microapplication grows organically over time to become a milliapplication, then the supervisory logic tends to grow like topsy. Controlling that mess often requires something equivalent to a very simple RTOS, so the "designer" reinvents the wheel; unfortunately it is usually an elliptical wheel.

This is the embedded equivalent of "any sufficiently complicated C or Fortran program contains an ad hoc informally-specified bug-ridden slow implementation of half of Common Lisp."

miguelvp · « **Reply #28 on:** September 18, 2014, 10:38:17 pm »

Quote from: dannyf on September 18, 2014, 11:01:56 am

Quite a few posters the numbers right, some closer than others.

So here are the numbers:

Quote
1) What's the incremental flash usage?

As you would expect, the precise number depends on compiler setting, kernel configuration and chips used.

On PIC24F, it goes from 5K to 8Kb (instructions only). On CM3, it goes from 4.5KB to 6KB.

The size of total compiled flash space, with a reasonably sized stack, goes from 10KB to 30KB, however.

Quote
2) What's the incremental (static) ram usage?

This varies greatly. Under the most basic heap management strategy (=no release of ram space from terminated tasks), the smallest is a few hundred bytes + heaps, to a few thousand KB - most of it in the heaps you configure for the tasks.

Quote
3) What the frequency of the flip, as a percentage vs. that of the naked flip?

On a 8Mhz PIC24F, the frequency of naked flip is about 400Khz. Under FreeRTOS, it is about 394Khz -> high 98%. That number dips as you slow down the mcu, to about low 98%. So the switching takes about 80 - 120 instructions per switch. That's roughly in the ballpark of figures I have seen: 150 - 300 instructions, shorter for 32-bit chips and longer for 8-bit chips.

Quote from: miguelvp on September 17, 2014, 02:02:42 am

1) 5KB
2) 256 bytes
3) 100%

But I'm just guessing since I know nothing of the naked pic24f nor fritos (I'm again hungry for a snak)

What did I win? A bag of Fritos?
I'm still craving them.

dannyf · « **Reply #29 on:** September 18, 2014, 10:53:06 pm »

Did a little more testing on a STM32F100RB running at 24Mhz. Standard peripheral library 3.5 used. 7 tasks running, each flipping the same led.

GCC compiler, under different flags.

Code: [Select]

Optimization	O0	O1	O2	O3	O3
Freq_rtos	846.8	1320	1321	1321	1321
Freq_naked	856.9	1333	1333	1333	1333
 Ticks on switch 	 283 	 234 	 216 	 216 	 216 
Code size	7912	5112	5016	5316	3984

Observations:
1) switching cost pretty high, for a 32-bit chip.
2) Lots of potential for code reduction.
3) Not much to be gained beyond O1.

dannyf · « **Reply #30 on:** September 18, 2014, 10:55:17 pm »

Note: Freq_rtos is the frequency of led flipping under FreeRTOS; Freq_naked is the frequency of led flipping without FreeRTOS, all in KHz.

RichardBarry · « **Reply #31 on:** September 19, 2014, 08:38:29 am »

I can't help but feel this thread is missing the whole point of an RTOS. It is always advisable to choose the right tool for the right job - and flipping an LED on a PIC24 is not the right place for an RTOS. In a semi complex system, especially with multiple communication interfaces, the performance penalty of an RTOS is *negative*, that is, you will get *much* more power out of your CPU and be able to include *much* more functionality by using an RTOS.

Why? Because when you don't use an RTOS you will have to poll all your interfaces. Polling consumes CPU time for no purpose. You can of course sit in a loop waiting for interrupts, but the same applies, and you get horrible inter-dependencies between different pieces of functionality. When you use an RTOS you can be completely event driven. CPU time is only used when there is actually something to do, and spare time is spent in the idle task. You can use the idle task as an automatic way of minimising power consumption - or use what was idle time to add additional functionality (or just use a smaller cheaper chip).

There are also lots of other reasons, the most important of which are related to maintainability - but really there are sooooooo many discussions of this on the interweb.

Also note when using a GCC based compiler, and some others, most of the code size actually comes from the libraries, not FreeRTOS.

http://www.freertos.org/FAQWhat.html#WhyUseRTOS

dannyf · « **Reply #32 on:** September 19, 2014, 09:17:55 am »

The last readings were done with a 5 digit frequency meter. So some rounding on the last digit may impact the measurements.

Now, I put a 8-digit frequency meter to get down to 1Hz resolution.

gcc vs. mdk, FreeRTOS 8.1.2., the same configuration file + test file used. The chip is the same - a 24Mhz STM32F100RB.

Code: [Select]

GCC	O0	O1	O2	O3	Os
 Freq_rtos 	 847 	 1,321 	 1,322 	 1,322 	 1,322 
 Freq_naked 	 857 	 1,333 	 1,333 	 1,333 	 1,333 
Efficiency	98.8%	99.1%	99.1%	99.2%	99.1%
 Ticks on switch 	 283 	 225 	 206 	 204 	 212 
Size	7912	5112	5016	5316	3984




Keil MDK	O0	O1	O2	O3	O3(time)
 Freq_rtos 	 918 	 1,327 	 1,327 	 1,327 	 1,327 
 Freq_naked 	 926 	 1,338 	 1,338 	 1,338 	 1,338 
Efficiency	99.1%	99.2%	99.2%	99.2%	99.2%
 Ticks on switch 	 219 	 202 	 195 	 195 	 194 
Size	3988	3304	3188	3188	3408

Quick observations:

1) the ticks spent on context switch is fairly consistent, in the low 200 instructions. On the high end of the numbers I have seen, as published by the vendors.

A side note, Keil published a 187-tick for context switching on a LPC1768, as a maximum figure for RTX.

2) gcc held its own well, in terms of speed. Minimum difference between the two at various compiler settings.

3) Keil does a better job at producing smaller code.

I will try to see what numbers I can get out of RTX/CMSIS-OS, when I get more time.

dannyf · « **Reply #33 on:** September 19, 2014, 09:29:26 am »

Quote

flipping an LED on a PIC24 is not the right place for an RTOS.

I would argue otherwise.

On flipping an led: The goal here is to see how much time is wasted in switching between tasks.

You have two extremes here:
1) each task takes its alloted time fully, so switching cost is minimum. Flipping an led here simulates that situation. From the mcu's and OS's point of view, it doesn't matter if it is flipping an led, doing some math, or idling around, its processing power is going somewhere. Flipping an led here provides a convenient way to measure the processing power dedicated to running those jobs -> the led isn't flipped when the mcu is busy switching tasks.

2) each task takes minimum time and the mcu spends more of its switching between jobs. Polling for buttons would fall into this category. Switching cost is maximized here -> ie., this is the least efficient way for the mcu.

Of the two, I would argue that reality is closer to 1) than 2).

As to PIC24F, the right chip to run RTOS is the chip that the programmer decides to use for a given task. It may not be the best chip to run a given RTOS. It may not be the best chip from which one can infer the RTOS's performance on other chips.

However, as the tests so far have shown, the pattern of performance carries nicely from the PIC24F test to the STM32F100 test: two vastly different chips, almost identical performance - one is in the mid-high 98% and another in the low 99%.

The PIC24F did not exhibit an ***identical*** performance to the CM3; However, it did exhibit a ***indicative / comparable *** performance to the CM3.

Hope it helps.

Precipice · « **Reply #34 on:** September 19, 2014, 09:42:52 am »

Quote from: dannyf on September 19, 2014, 09:29:26 am

Of the two, I would argue that reality is closer to 1) than 2).

Hmm, unconvinced. Micros I come in contact with tend to spend (wild guess) less than 1% of their time doing stuff. Often far, far less. On my desk it a motor controller that runs hard for 10 seconds at startup, then tends to idle for a month. Of course, if I polled for button presses rather than sleeping and waiting for an edge interrupt, it would be the other way round.
The busiest micros I think I deal with are decoding video, and even then, they've usually got a lot of slack (>50%) because most frames are easier than the hardest frames that the CPU has to be sized to handle.

dannyf · « **Reply #35 on:** September 19, 2014, 10:07:06 am »

FreeRTOS vs. RTX:

Now is a comparison between FreeRTOS and RTX. The same hardware (a 24Mhz STM32F100RB is used), the same toolchain (Keil MDK, same compiler settings). The only difference here is the RTOS used, and the configuration - largely comparable but not identical, because the way they are set-up.

Code: [Select]

Keil MDK / FreeRTOS	O0	O1	O2	O3	O3(time)
 Freq_rtos 	 918 	 1,327 	 1,327 	 1,327 	 1,327 
 Freq_naked 	 926 	 1,338 	 1,338 	 1,338 	 1,338 
Efficiency	99.1%	99.2%	99.2%	99.2%	99.2%
 Ticks on switch 	 219 	 202 	 195 	 195 	 194 
Size	3988	3304	3188	3188	3408




Keil MDK / RTX	O0	O1	O2	O3	O3(time)
 Freq_rtos 	 918 	 1,326 	 1,326 	 1,326 	 1,326 
 Freq_naked 	 926 	 1,338 	 1,338 	 1,338 	 1,338 
Efficiency	99.1%	99.1%	99.1%	99.1%	99.1%
 Ticks on switch 	 217 	 218 	 219 	 217 	 220 
Size	4756	4456	4424	4428	4428

Quick observations:

1) the performance is largely comparable. Both are low 99% efficient, and flip the led at roughly the same frequency.
2) FreeRTOS has slightly lower switching cost, and slightly smaller footprint.
3) both are comparably simple to setup and to use.

It is a toss-up. FreeRTOS offers better portability across toolchains / chips. RTX is tied to Keil's offerings, but better support, at a monetary cost.

dannyf · « **Reply #36 on:** September 19, 2014, 10:58:42 am »

Quote

Keil published a 187-tick for context switching on a LPC1768, as a maximum figure for RTX.

I didn't test a lpc1768 but the figures for STM32F1 (~220 switching cost) is roughly comparable to the 187 official figure.

dannyf · « **Reply #37 on:** September 19, 2014, 11:26:49 am »

CoOS vs. FreeRTOS:

CoIDE has its own rtos, CoOS. Fairly easy to use (1-click away from inclusion into your project).

Identical project (aside from the rtoses used), identical compiler / flags.

Now, the numbers:

Code: [Select]

GCC / CoOS	O0	O1	O2	O3	Os
 Freq_rtos 	 847 	 1,321 	 1,321 	 1,322 	 1,322 
 Freq_naked 	 857 	 1,333 	 1,333 	 1,333 	 1,333 
Efficiency	98.8%	99.1%	99.1%	99.2%	99.1%
 Ticks on switch 	 287 	 220 	 220 	 202 	 206 
Size	11564	7228	7256	7564	5432
					
					
					
					
GCC / FreeRTOS	O0	O1	O2	O3	Os
 Freq_rtos 	 847 	 1,321 	 1,322 	 1,322 	 1,322 
 Freq_naked 	 857 	 1,333 	 1,333 	 1,333 	 1,333 
Efficiency	98.8%	99.1%	99.1%	99.2%	99.1%
 Ticks on switch 	 283 	 225 	 206 	 204 	 212 
Size	7912	5112	5016	5316	3984

Quick observations:

1) practically identical performance.
2) CoOS takes considerably more space: the comparison isn't exactly fair there for CoOS - it uses fixed ram space for stacks for each individual tasks.

Difficult to justify using CoOS because of the limited support for chips and cross-toolchains, vs. what it offers over FreeRTOS.

dannyf · « **Reply #38 on:** September 19, 2014, 11:30:49 am »

Quote

CPU time is only used when there is actually something to do, and spare time is spent in the idle task.

That to me is a spin: the "spare time" is cpu time too. It is either spent on doing something (aka tasks) or doing nothing (aka idle task in RTOS or looping around in a naked environment), identical to what the cpu is doing without an RTOS.

Quote

You can use the idle task as an automatic way of minimising power consumption - or use what was idle time to add additional functionality (or just use a smaller cheaper chip).

That can be easily and in my view more simply implemented in a non-RTOS environment.

And in a non-RTOS environment, it is much easier to go down to a much smaller / lower spec'd chip.

dannyf · « **Reply #39 on:** September 19, 2014, 11:35:33 am »

On switching cost:

CooCox quoted a 1.5us/72Mhz number. That translates into 108 ticks, vs. ~200 ticks measured.

FreeRTOS quoted a 84 ticks number, vs. ~200 ticks measured.

Keil quoted a 1.6us/72Mhz number (115 ticks), vs. ~200 ticks measured.

Somehow, the 200-tick figure is pretty good.

tggzzz · « **Reply #40 on:** September 19, 2014, 11:59:37 am »

Quote from: dannyf on September 19, 2014, 11:30:49 am

Quote
You can use the idle task as an automatic way of minimising power consumption - or use what was idle time to add additional functionality (or just use a smaller cheaper chip).

That can be easily and in my view more simply implemented in a non-RTOS environment.

That depends on your application, its architectural patterns, and the way in which it has been implemented. The point can be argued either way.

Of course, if you said "I could implement it more easily on my systems", I wouldn't argue.

Quote

And in a non-RTOS environment, it is much easier to go down to a much smaller / lower spec'd chip.

Only if either the chip was grossly overspecified or if the application was so poorly implemented that it spent too much of its time "inside" the RTOS.

Don't be too dogmatic!

dannyf · « **Reply #41 on:** September 19, 2014, 12:11:47 pm »

Quote

What did I win?

You get to feel good about your getting it right.

dannyf · « **Reply #42 on:** September 19, 2014, 12:36:11 pm »

Got CoOS to work on STM32F030F.

Practically no ram left,

But it does blink a pin merrily.

mikerj · « **Reply #43 on:** September 20, 2014, 09:58:15 am »

Are you simply toggling one of the ten LED as fast as it will go in each task during it's allocated 1ms time slice, i.e. so each LED only toggles for 1ms at a time? Does your non-RTOS code do exactly the same thing?

If so then comparing the toggle frequency is a little pointless; it should be obvious they will either be the same, or at least can be made the same with some optimisation. The actual loop doing the toggling shouldn't need to differ unless you are adding extra functionality and whilst a task is executing in it's time slice the RTOS takes no CPU overhead (provided the task doesn't call an RTOS API function).

Aside from memory overhead, the task switching is really the only relevant performance parameter here i.e. how much time is lost whist no LEDs are toggling. If this was an important parameter then using an RTOS for such a trivial application would be daft.

Why are you completely avoiding replying to any questions or criticism regarding the implementation or relevance of this test?

dannyf · « **Reply #44 on:** September 20, 2014, 11:36:25 am »

"How about protothreads?"

In the original discussion that started me thinking about RTOS overhead, I mentioned that depending on your definition, an OS can be as simple as a switch/case statement.

That's precisely what protothreads is: a set of switch/case statement. A basic task in protothreads basically looks like this:

Code: [Select]

task0:
  while (1) {
    if (exit condition is met) return;
    do_something;
  }

Because of this, it has a few interesting characteristics that made the comparison here difficult (unfair to protothreads):

1) it is immensely portable: any C compiler that supports macros and switch/case would be good to go for protothread;
2) it has practically zero flash / ram footprint: one switch case and some tests is all there is.

Two issues with protothread:
1) if exit_condition isn't met, the execution moves to the real task and there is no mechanism to "interrupt", or "switch away" from that task. If you have a long task, you will not exit it until its execution has ended. For an application with lots of disparately long/short tasks, that's bad.
2) if you have a very short task, then the exit_condition is frequently tested and the mcu spends more of its time, percentage-wise, determining if exit condition has been met.

That (a very short task) is unfortunately where we are. Blinking an led / flipping a pin doesn't take much time. So for each flip, you have to test the exit condition. That's time ***wasted***.
*** unlike in a real OS where time is wasted switching context / jobs, protothreads would have very low cost switching in between tasks -> just the overhead of existing the previous task and calling the next one. The waste is generated within each thread testing the exit condition.

To conclude, if you are to run the same test comparing protothreads vs. other OSs mentioned earlier, you would expect that protothreads to have low memory footprint, but a big performance hit -> entirely due to how this test is set up.

When I get sometime, I will see if I can set it up on a chip and run some numbers.

tggzzz · « **Reply #45 on:** September 20, 2014, 03:14:56 pm »

Quote from: mikerj on September 20, 2014, 09:58:15 am

Why are you completely avoiding replying to any questions or criticism regarding the implementation or relevance of this test?

Quite. That's why I stopped actively contributing to this thread - I felt the conversation was unindirectional, and the phrase "there's none so deaf as thems won't hear" sprang to mind.

Maybe I'm too pessimistic, but we'll see.

miguelvp · « **Reply #46 on:** September 20, 2014, 05:25:06 pm »

RTOS has it's place for very complex systems that need to be scheduled to meet certain time constrains. OS-9 was the best out there even better than VxWorks or eCos, but it seems they been fading out on the last decade even if they did support arm processors.

Anyhow, when you have to do many decoupled tasks then it's when an RTOS will save you greatly on development time. Sure you can do a custom program that will be better performance wise, but the more complex the system the time to develop it will increase exponentially and debugging it will take too much time and resources.

Unless you do your own scheduler etc but then you will be running an RTOS.

The real purpose of an RTOS is time to market and development cost.

dannyf · « **Reply #47 on:** September 20, 2014, 06:16:40 pm »

As promised, here is a test between protosthreads and CoOS, on the ghetto board (STM32F030F running at 24Mhz). Compiler is gcc under CoIDE.

First, CoOS:

Code: [Select]

GCC / CoOS	O0	O1	O2	O3	Os
 Freq_rtos 	 789 	 1,320 	 1,319 	 1,319 	 1,485 
 Freq_naked 	 800 	 1,332 	 1,331 	 1,331 	 1,498 
Efficiency	98.6%	99.1%	99.1%	99.1%	99.1%
 Ticks on switch 	 324 	 204 	 215 	 211 	 215 
Size	7868	4920	4980	5340	4476

Generally in line with the numbers I had for STM32F100RB.

Now, the numbers for protothread:

Code: [Select]

GCC / Protothreads	O0	O1	O2	O3	Os
 Freq_rtos 	 352 	 705 	 877 	 878 	 1,027 
 Freq_naked 	 800 	 1,332 	 1,331 	 1,331 	 1,498 
Efficiency	44.0%	52.9%	65.9%	66.0%	68.5%
 Ticks on switch 	 13,437 	 11,295 	 8,185 	 8,169 	 7,556 
Size	3084	1796	1928	1912	1636

A few things:
1) the libraries used are identical;
2) the user tasks are "largely" identical - 6 tasks blinking the same led. Because the way protothread is configured, you have to set the conditions on each run so the mcu is not blinking the led as fast as it could and you can see that in the efficiency measurements.
3) The foot print of protothread is minimum, as we had expected. The difference is approximately the size of CoOS.
4) Because of structural differences, "ticks on switch" measurements make no sense for protothread. The more meaningful measurements are efficiency: how much time the mcu is actually doing your task, vs. running the OS: switching context in the case of CoOS or testing exit conditions in the case of protothreads.

In the end, I think a lightweight "OS" like protothreads has value on small devices where your tasks are quite similar in execution time. If that's indeed the case, writing your own scheduler or just sequentializing the tasks isn't a bad idea.

For larger chips, a real OS is likely to be more useful.

mikerj · « **Reply #48 on:** September 20, 2014, 11:23:37 pm »

Quote from: dannyf on September 20, 2014, 06:16:40 pm

In the end, I think a lightweight "OS" like protothreads has value on small devices where your tasks are quite similar in execution time. If that's indeed the case, writing your own scheduler or just sequentializing the tasks isn't a bad idea.

Protothreads are simply a way of implementing state machines via the C pre-processor. State machines are most useful when you spend a reasonable amount of time in any particular state if any significant processing is required e.g. the 1ms or so that a conventional RTOS might use as a time slice. If you were only toggling an LED once per state then the overhead would be quite large.

Hand crafted state machines with proper enumerated states will be more efficient than Protothreads in many situations because the state labels can be made consecutive (permitting easy implementation into a small jump table) and will also be numerically small in the majority of cases, permitting the state value to fit into an 8 bit integer which can be a very useful saving in time and memory on on 8 bit micros where you would typically consider a state machine design. The code may not look as tidy as a Protothreads implementation however.

Obviously I don't expect any kind of response, but maybe this will help someone.

gxti · « **Reply #49 on:** September 21, 2014, 01:12:11 am »

The main benefit to CoOS is that it is permissively licensed -- you can embed it into a proprietary application without worrying about license compliance. FreeRTOS and ChibiOS use a modified GPL license that allows you to link against it, but if you modify them then you have to open-source your modifications. And either way you still need the appropriate legal boilerplate in order to comply with the license.

That said, CoOS is pretty crappy. It mostly gets the job done but the code is stringy, poorly commented, and had at least one crippling race condition bug with semaphores that they claim to have fixed but I'm not entirely sure. I switched to it from chibios due to the license, but now I'm thinking about switching back because I'm hitting some strange scheduler-related bug that may or may not be the OS's fault, and nobody wants to spend time hunting down OS bugs.


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

Author Topic: FreeRTOS performance penalty (Read 32384 times)

Share me