Author Topic: FreeRTOS performance penalty (Read 32330 times)

dannyf · « **Reply #50 on:** September 21, 2014, 10:19:54 am »

Quote

nobody wants to spend time hunting down OS bugs.

That's a plus for a commercial OS.

gmb42 · « **Reply #51 on:** September 21, 2014, 10:25:54 am »

Quote from: dannyf on September 17, 2014, 01:41:46 am

Test: I am comparing two blinkies,

1) no FreeRTOS / naked, flipping PB.0 and measure the frequency of the flip;
2) with FreeRTOS, flipping PB.0 through 10 separate tasks (no messaging between them), and measure the frequency of the flip.

As others have mentioned, the nature of the test is unclear, leading to difficulties in trying to replicate the experiment. Posting code (or pseudo-code) would help.

Are the ten tasks being effectively round-robined to flip the pin on and off at each task cycle or do they just invert the current pin state?

Segger have produced a methodology (here) for measuring context switch times that may be of relevance.

Another point when benchmarking software, is permission to do so. See Clause 2 in the FreeRTOS licence which implies (to me) that the user must obtain permission to publish the results seen in this thread.

dannyf · « **Reply #52 on:** September 21, 2014, 11:07:33 am »

The basic principle is quite simple: when any of the tasks is running, it is flipping an led - how it flips does not really matter, as long as it does flip the pins; during the context switch, the mcu is not running any of the tasks so the pin is not being flipped.

Thus, if you count the number of pin flips during a given period of time (aka measuring its frequency), with and with an OS, you get to measure how much time is wasted during context switch.

Say that the pin is flipped 100K times / second without an OS; and 99K times / second with an OS. You know that you are missing 1k pulses, or 1% of a second spent in context switches, or 10ms per 1 second period.

Since your tasks run on 1ms time slices, you have switched 1000 times in 1 second. Thus your time in each context switch is 10ms / 1000 = 10us.

Pretty trivial.

Quote

Segger have produced a methodology

The two approaches are identical in that they all rely on the fact that the output does not change during a context switch. In my case, the output is not being flipped during a context switch; in Segger's case, the output remains the same (low) during a context switch.

Segger's approach requires a fast scope and for very fast switches its measurement precision may be limited by the scope's timing resolution. Mine requires simple measurement of frequencies.

tggzzz · « **Reply #53 on:** September 21, 2014, 11:58:20 am »

Quote from: dannyf on September 21, 2014, 11:07:33 am

Thus, if you count the number of pin flips during a given period of time (aka measuring its frequency), with and with an OS, you get to measure how much time is wasted during context switch.

Segger's approach requires a fast scope and for very fast switches its measurement precision may be limited by the scope's timing resolution. Mine requires simple measurement of frequencies.

Finding context switch time is a useful measurement. Your way of doing it is unnecessarily bizarre and quite probably the results are obscured by effects that you haven't considered. For a start consider the effects of caches.

Your comment about scope's time resolution is unlikely to be valid unless either it is an extremely fast processor or you have an extremely slow scope. How fast is your scope?

The scope technique has many useful virtues and can be used for other measurements, e.g. interrupt latency. Your technique has many severe limitations and no overwhelming advantages.

Fine to invent a new technique, but not noting its limitations and why (you think) it is beneficial is very rude: it unnecessarily wastes other people's time.

Kjelt · « **Reply #54 on:** September 21, 2014, 12:46:08 pm »

Quote from: dannyf on September 21, 2014, 11:07:33 am

Say that the pin is flipped 100K times / second without an OS; and 99K times / second with an OS. You know that you are missing 1k pulses, or 1% of a second spent in context switches, or 10ms per 1 second period.

That only upholds for a project where your alternative to an OS is a superloop only doing one task namely flashing a led.
So since there is no use in such a theoretical project only handling one task, in any real project the superloop will contain many more tasks thus making the OS relatively less costly.
IMO you choose the worst possible testcondition for the OS, not that it is wrong but it should be considered that in any real project the results for using an OS will be better.

dannyf · « **Reply #55 on:** September 21, 2014, 02:04:14 pm »

Quote

That only upholds for a project where your alternative to an OS is a superloop only doing one task namely flashing a led.

Think about the flashing the led as a proxy to the mcu's processing power: at any point, the mcu can either be doing something useful, or switching context.

In this case, the "useful" thing is being simulated by flashing the led / flipping a pin actually. So any time the pin is not being flipped, it is being consumed in switching context.

That's all there is to it.

In the case of a real RTOS, the "frequency" of the pin being flipped is actually identical, with or without the RTOS. Each "task", when it is runing, is flipping the pin in a fashion identical to the loop that flips the pin without an OS, for its time slice.

So what you will see is that the pin is being flipped at 100Khz for 1ms, the flipping then stopped for a few us when the mcu switches the context, and then the flipping resumes, at 100Khz, when the next task takes over.

Quote

IMO you choose the worst possible testcondition for the OS,

We discussed this earlier and the exact opposite is true - the mcu spends the most of its time running the tasks under this particular test.

You can think of it this way: in the above example, each task runs for 1ms (ie fully utilizing its time slice), and then for another 10us the mcu switches context and no user code is being run during that period of time.

The opposite would be to run a very simple task (flipping a pin) and immediate switch out to the next task <for 1us or so> - one person I think suggested this, the mcu spends the next 10us doing context switching. You would observe a very low frequency -> on that particular mcu (PIC24F), the frequency is 17K, vs. 400Khz running naked or 394Khz running full time slice.

So which the time spent in context switching is the same, if you increase the number of context switch, the efficiency suffers. The test we are doing utilizes the full time slice so it has the highest efficiency possible, ie. the best case scenario.

tggzzz · « **Reply #56 on:** September 21, 2014, 02:18:25 pm »

dannyf: how fast is your scope?

(You previously discounted the standard direct measurement techniques in favour of your strange indirecet imprecise technique because "Segger's approach requires a fast scope and for very fast switches its measurement precision may be limited by the scope's timing resolution.")

dannyf · « **Reply #57 on:** September 21, 2014, 02:44:34 pm »

Quote

So what you will see is that the pin is being flipped at 100Khz for 1ms, the flipping then stopped for a few us when the mcu switches the context, and then the flipping resumes, at 100Khz, when the next task takes over.

We talked about this earlier: that those flippings take the form of "chunks", each followed by a period of "silence" due to context switching.

Here is a capture of a series of such "chunks". In the chart below, you will see little gaps separating 1ms of flippings.

dannyf · « **Reply #58 on:** September 21, 2014, 02:46:23 pm »

We can blow up those gaps for closer examination:

As you may have noticed, some of the gaps seem to be slightly wider than others. That's due to us sampling at a fairly low speed of 1Mhz. So at each end, we could be off by a max of 1us.

dannyf · « **Reply #59 on:** September 21, 2014, 02:47:55 pm »

At a higher sample rate, capturing the gap is more difficult but quantifying the gap is much easier.

Here is one at 12Mhz (timing resolution of 0.08us).

The gap here is 7.917us.

dannyf · « **Reply #60 on:** September 21, 2014, 02:53:14 pm »

Quote

The gap here is 7.917us.

That's a STM32F100RB running RTX @ 24Mhz, O3 flag.

The measured efficiency I reported earlier is 99.1%. Or 9us on a 1ms time slice.

The two measurements are fairly consistent, considering the max error of 0.08us * 2 due to sampling.

dannyf · « **Reply #61 on:** September 21, 2014, 02:56:32 pm »

Hopefully those pictures will help you visualize how the RTOS works in conjunction with the tasks.

hans · « **Reply #62 on:** September 21, 2014, 05:07:36 pm »

To be more correct you need to subtract the low time of a pin toggle, as that's now included in the time measured.
To be completely correct you need to include the overhead of writing to a GPIO port as well, as some chips need to resolve pointers or do read-modify-write in software (hence the SET/CLR registers on some microcontrollers).

At that point you may as well run the same test as Segger does.

With 12MHz sample rate on a 24MHz clock rate you're still accurate to only 2 cycles.
I would suggest underclocking (instruction cache on a M0 or PIC24 is not a big issue I think) or getting a scope. Even a half-crappy USB scope will do 100MS/s, which is decent for 100MHz / 1 clock cycle or 200MHz / 2 clock cycles.

dannyf · « **Reply #63 on:** September 21, 2014, 05:29:50 pm »

Quote

To be more correct you need to subtract the low time of a pin toggle, as that's now included in the time measured.
To be completely correct you need to include the overhead of writing to a GPIO port as well, as some chips need to resolve pointers or do read-modify-write in software (hence the SET/CLR registers on some microcontrollers).

You may want to think it through.

Quote

At that point you may as well run the same test as Segger does.

Those "issues" you identified earlier apply equally to Segger's approach.

It takes a little bit of brain power to process, but those two approaches are really identical.

Kjelt · « **Reply #64 on:** September 21, 2014, 08:54:01 pm »

Quote from: dannyf on September 21, 2014, 02:53:14 pm

The measured efficiency I reported earlier is 99.1%. Or 9us on a 1ms time slice.

So the RTOS costs 1% cpu time, I can definitely live with that having the luxury of an RTOS taking care of all those other things for me.

dannyf · « **Reply #65 on:** September 21, 2014, 09:34:09 pm »

That was the gist of the tests:

1) all the major RTOSs have comparable performance (context switch): ~200 instructions. I didn't test embOS but the numbers provided by Segger would imply the same.

2) FreeRTOS did surprisingly well, putting aside its Clause 2 prohibiting benchmarking.

The 99% figure, however, needs to be taken with some precaution, as it is the "best case" scenario - minimum switching given the time slice. If you have lots of shorter tasks - reading a button for example, your efficiency will suffer. On the flip side, we did the test at 1ms, more on the aggressive end of the scale. Retarding it to 10ms would be more realistic, I think.

In the end, I am unsure about FreeRTOS. I have used uCOS II/III for quite some time and find them quite reliable, with a much bigger footprint. I also have access to RTX so the appeal of FreeRTOS isn't that great for me.

But for someone needing an open source RTOS at a lower price point, you can use FreeRTOS knowing that from the point of context switching, you aren't losing to the big boys.

Jeroen3 · « **Reply #66 on:** September 22, 2014, 07:05:56 am »

The test you performed test is unrealistic. You expect each silence on the toggling is because of a context switch. You forgot about a lot of other factors. (as mentioned before)
Caching, pre-fetching, scheduling method, bus wait states and last but not least interrupt jitter. Since nested interrupts are not allowed in most applications.
The results of you test will not be valid if you enable any peripheral with an interrupt.

westfw · « **Reply #67 on:** September 22, 2014, 07:28:38 am »

I'm not sure. Yes, the test only captures the performance penalty incurred when a task is blocked due to its run quantum being used up. But all other reasons for preemption would require an interrupt, and an interrupt would mess up the timing of the blink loop without the RTOS as well. So, the test only measures the "minimum" overhead that an RTOS might add. But I think that's what it was supposed to measure...

Precipice · « **Reply #68 on:** September 22, 2014, 07:30:15 am »

Maybe the test is unrealistic but useful, in that it suggests that available RTOSes aren't utter cycle / space hogs, and it'll discourage a few people from rolling their own, which (from observation as a hardware guy watching software projects for decades) _always_ takes longer, and is more buggy / annoying than anyone expects.

If a case can even remotely and slightly wrongly be made for an RTOS underpinning blinky(), then why not make that your default. Learn one, use it. Move on and write your application.
By the time the project grows / changes and needs an RTOS, there'll be one under you already, and you won't suddenly have a huge screeching halt as you need to turn your code inside-out as you move from a superloop to something scheduled.

Or not. I'm just a hardware guy, I've got plenty of other jobs to be getting on with while you reinvent the wheel for the thousandth... frigging... time...

Kremmen · « **Reply #69 on:** September 22, 2014, 07:57:13 am »

I tried to make just this point several posts back. An RTOS is not a silver bullet but it is a very handy tool. If you are already familiar with one the day the real need pops up, you are that far ahead in your project.

tggzzz · « **Reply #70 on:** September 22, 2014, 08:54:52 am »

Quote from: Precipice on September 22, 2014, 07:30:15 am

Maybe the test is unrealistic but useful, in that it suggests that available RTOSes aren't utter cycle / space hogs, and it'll discourage a few people from rolling their own, which (from observation as a hardware guy watching software projects for decades) _always_ takes longer, and is more buggy / annoying than anyone expects.

Quite, but that's a pretty weak and useless statement: people wouldn't be using RTOSs if they were performance hogs. But you knew that!

I'm puzzled why dannyf chose to occupy our time with his obtuse and ambiguous test. He has made statements to effect of not needing "high speed" oscilloscopes, but hasn't quantified "high speed", despite repeated requests.

Quote

If a case can even remotely and slightly wrongly be made for an RTOS underpinning blinky(), then why not make that your default. Learn one, use it.

Always useful to know what tools can/can't do.

Quote

Move on and write your application.
By the time the project grows / changes and needs an RTOS, there'll be one under you already, and you won't suddenly have a huge screeching halt as you need to turn your code inside-out as you move from a superloop to something scheduled.

Just so.

Quote

Or not. I'm just a hardware guy, I've got plenty of other jobs to be getting on with while you reinvent the wheel for the thousandth... frigging... time...

You forgot to mention the reinvented wheel will be elliptical.

dannyf · « **Reply #71 on:** September 22, 2014, 09:59:46 am »

Quote

The test you performed test is unrealistic.

Any test is unrealistic, just as any theory / model is unrealistic - that's by design.

The purpose of a test isn't to provide a precise measurement for a particular application, but to provide an indication, sometimes even ballpark indication, for a set of applications.

If one of the test conditions isn't valid for your application, obviously the results of the test are invalid, in the sense that the measurements from the tests are not precisely applicable to you.

However, the results of the test are still valid indications of how your application is likely to perform.

How good of an indication will depend on the materiality and relevance of such test conditions.

Quote

Caching, pre-fetching, scheduling method, bus wait states and last but not least interrupt jitter.

I would argue that all the above are either immaterial or irrelevant.

Take pre-fetch for example. You incur it twice in any context switch, first at fetching the code for the context switch itself and then at fetching the code for the next task. However, unless you have an OS that can pre-determine, with certainty, which piece of code will be executed next and pre-fetch that piece of code into the pipeline, you will always incur that code.

ie., pre-fetching, however ineffective or costly it may be, has no impact in comparing RTOSs that cannot pre-determine, with certainty, which piece of code will be executed next - as they all incur this cost equally.

and I would say that 100% of the RTOSs that have ever existed and will exist for a loooooooong period of time to come fall into t hat category, unfortunately.

tggzzz · « **Reply #72 on:** September 22, 2014, 10:59:16 am »

Quote from: dannyf on September 22, 2014, 09:59:46 am

Quote
The test you performed test is unrealistic.

Any test is unrealistic, just as any theory / model is unrealistic - that's by design.

Yes, but some tests are more unrealistic that others - and (except for blinkies) your test is unnecessarily obtuse, indirect, unrealistic and therefore unhelpful.

Quote

The purpose of a test isn't to provide a precise measurement for a particular application, but to provide an indication, sometimes even ballpark indication, for a set of applications.

You are aiming far too low! When I write tests they are designed to give precise measurements for a wide range of applications. If not then I don't waste other peoples' time by publishing them.

Why don't you simply use the standard techniques for measuring parameters that provide useful information for a wide range of applications?

Quote

Quote
Caching, pre-fetching, scheduling method, bus wait states and last but not least interrupt jitter.

I would argue that all the above are either immaterial or irrelevant.

Take pre-fetch for example. You incur it twice in any context switch, first at fetching the code for the context switch itself and then at fetching the code for the next task. However, unless you have an OS that can pre-determine, with certainty, which piece of code will be executed next and pre-fetch that piece of code into the pipeline, you will always incur that code.

That's a silly misdirection: if it is done, it is the the hardware that does the prefetching, caching, bus wait states, scheduling bus transactions, and interrupts. The OS schedules tasks.

Quote

ie., pre-fetching, however ineffective or costly it may be, has no impact in comparing RTOSs that cannot pre-determine, with certainty, which piece of code will be executed next - as they all incur this cost equally.

That's strictly true, but misses the useful point. The impact of prefetching, caches etc can be critically important in hard-realtime systems. In addition the data structures used in RTOSs can be more or less friendly to those performance factors.

mikerj · « **Reply #73 on:** September 22, 2014, 01:40:35 pm »

Quote from: dannyf on September 22, 2014, 09:59:46 am

Quote
Caching, pre-fetching, scheduling method, bus wait states and last but not least interrupt jitter.

I would argue that all the above are either immaterial or irrelevant.

Interrupt jitter is certainly relevant when working with an RTOS since it can be higher than it would be in non-RTOS based code. This is because interrupts that call RTOS functions will typically need to be disabled whilst RTOS code is being executed (e.g. during a context switch or some API calls).

Jeroen3 · « **Reply #74 on:** September 22, 2014, 01:51:37 pm »

A badly configured RTOS can kill your very expensive XYZ-machine by responding to the end switch or force sensor a few milliseconds late due to interrupt jitter. Certainly not irrelevant.


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

Author Topic: FreeRTOS performance penalty (Read 32330 times)

Share me