Async await for embedded C/C++

Async await for embedded C/C++
Posted by danieleff on 11 Aug, 2017 06:55
Async / Await is a language feature added to C#, Python, javascript...

The point is to write code that reads like it is blocking, when in fact it is cooperative non-blocking multitasking. Something like Protothreads / finite state machine, but as language feature.
Could be used instead of an RTOS when it is not needed/not desirable, and easier to reason about as it is not preemtive, and could be stackless.

As far as I know, only Visual C++ 2015 implements such a thing(?). Anybody tried it / has experience with it?

Is there interest in this language construct for microcontrollers?

#1 Reply
Posted by andersm on 11 Aug, 2017 10:39
Coroutines have been proposed, but didn't make it into C++17. In the meanwhile, there are existing implementations of coroutines, eg. Boost.Asio, and libraries like Protothreads.

#2 Reply
Posted by nctnico on 12 Aug, 2017 08:29
I'm not a fan of code which doesn't do what it appears to do. I've looked at Adam Dunkel's (IIRC) protothreads but it is a mess. Better go for a statemachine or a real OS.

#3 Reply
Posted by janoc on 12 Aug, 2017 09:57
I hope you do realize that the async/await construct is just a syntactic sugar, typically for a lambda (anonymous) function executed in a separate thread (async) and the await part blocks on some synchronization primitive until the lambda finishes its work (C#) or wrapping a generator-like coroutine using something like select() deep in the guts to wait for events, such as data arriving on a socket/descriptor (Python).

So you won't really gain anything in efficiency/code size. This idiom is mainly there to simplify writing such code by making it less verbose and doing the required housekeeping around the threads and synchronization for you. Whether you use a state machine or have the compiler build one for you by unwrapping these constructs, it will still be there.

#4 Reply
Posted by hans on 12 Aug, 2017 11:32
I don't think I would like to turn my embedded code into something like Node.JS, no thank you.

In addition, await in languages like C# is useful if you're waiting for a piece of I/O (file, http, etc.) to complete. However these type of constructs makes code harder to unit test, so if avoidable I will.

Moreover if you dive into real time systems theory you'll find that non-preemptive scheduling algorithms can be one of the hardest systems to design. It sure is low resources, but it also highlights how big a quality of life thing a RTOS really is.
In addition if you're so constrained on asynchronous operations you could perhaps implement something like a fixed/dynamic priority server.

#5 Reply
Posted by tggzzz on 12 Aug, 2017 12:06
Quote from: nctnico on 12 Aug, 2017 08:29
I'm not a fan of code which doesn't do what it appears to do. I've looked at Adam Dunkel's (IIRC) protothreads but it is a mess. Better go for a statemachine or a real OS.

Just so, precisely.

#6 Reply
Posted by tggzzz on 12 Aug, 2017 12:17
Quote from: hans on 12 Aug, 2017 11:32
Moreover if you dive into real time systems theory you'll find that non-preemptive scheduling algorithms can be one of the hardest systems to design. It sure is low resources, but it also highlights how big a quality of life thing a RTOS really is.

... and you will also find that non-preemptive and cooperative architectures and implementations can (with wise choices) be far easier to test and verify they work within hard realtime limits. (Of course with bad choices you can make anything more difficult!)

So, you have a choice to make. Do you either make it easier to code and more difficult to test/prove, or easier to verify operation and pass acceptance tests but a little more time consuming to code.

Naturally the world is full of shades of grey, not black and white.

#7 Reply
Posted by Fredderic on 13 Aug, 2017 04:06
Quote from: janoc on 12 Aug, 2017 09:57
I hope you do realize that the async/await construct is just a syntactic sugar, typically for a lambda (anonymous) function executed in a separate thread (async) and the await part blocks on some synchronization primitive until the lambda finishes its work (C#) or wrapping a generator-like coroutine using something like select() deep in the guts to wait for events, such as data arriving on a socket/descriptor (Python).

That "syntactic sugar" does, however, make it enormously easier to read and follow your program flow, in most cases. Trying to do that stuff the hard way, can end up a messy convoluted horror. Also, there's nothing at all wrong with a lambda, and most decent compilers can sometimes boil most of that away again with inlining and unrolling loops and all the rest.

Also Python — being a bytecode interpreter — keeps the bytecode instruction pointer and a dedicated micro-stack within the execution state of every function call. So every time you call a function, it gets it's own fresh stack, and a yield actually ends up being a hair faster than a function call (let alone the extra switch and return location storage you'd need to do manually, doing it any other way) because it doesn't need to destroy and recreate that execution state all the time. (I do wish they'd pull yield from back into 2.7, though.)

I'm not sure how C# and the likes do it, but I'd assume they've figured something out better than a switch statement by now; like storing the next lambda-ized chunk of your function as the handler of that select, so it gets called more or less directly. Dismissing such things as merely "syntactic sugar", is, short sighted.

Quote from: janoc on 12 Aug, 2017 09:57
So you won't really gain anything in efficiency/code size. This idiom is mainly there to simplify writing such code by making it less verbose and doing the required housekeeping around the threads and synchronization for you. Whether you use a state machine or have the compiler build one for you by unwrapping these constructs, it will still be there.

Except that the compiler can take shortcuts, and/or decide to structure it's output differently to support that functionality. Doing it yourself by hand, you can't.

#8 Reply
Posted by ejeffrey on 13 Aug, 2017 04:43
Quote from: janoc on 12 Aug, 2017 09:57

So you won't really gain anything in efficiency/code size. This idiom is mainly there to simplify writing such code by making it less verbose and doing the required housekeeping around the threads and synchronization for you. Whether you use a state machine or have the compiler build one for you by unwrapping these constructs, it will still be there.

Well it gets you something in terms of correctness. Witness the evolution of python from sugar-free generator based coroutines to the yield from syntax to async def / await. That was in a large part driven by the difficulty of getting the semantics correct without language support. Mostly with respect to exception delivery and stack unwinding. Making it a language feature also means that the debugger can make sense of the stack -- which IMO overwhelms all other considerations. cooperative multi-tasking that is not part of the language / OS is not worth using.

#9 Reply
Posted by janoc on 13 Aug, 2017 10:23
Quote from: Fredderic on 13 Aug, 2017 04:06
Quote from: janoc on 12 Aug, 2017 09:57
I hope you do realize that the async/await construct is just a syntactic sugar, typically for a lambda (anonymous) function executed in a separate thread (async) and the await part blocks on some synchronization primitive until the lambda finishes its work (C#) or wrapping a generator-like coroutine using something like select() deep in the guts to wait for events, such as data arriving on a socket/descriptor (Python).

That "syntactic sugar" does, however, make it enormously easier to read and follow your program flow, in most cases. Trying to do that stuff the hard way, can end up a messy convoluted horror. Also, there's nothing at all wrong with a lambda, and most decent compilers can sometimes boil most of that away again with inlining and unrolling loops and all the rest.

Also Python — being a bytecode interpreter — keeps the bytecode instruction pointer and a dedicated micro-stack within the execution state of every function call. So every time you call a function, it gets it's own fresh stack, and a yield actually ends up being a hair faster than a function call (let alone the extra switch and return location storage you'd need to do manually, doing it any other way) because it doesn't need to destroy and recreate that execution state all the time. (I do wish they'd pull yield from back into 2.7, though.)

I'm not sure how C# and the likes do it, but I'd assume they've figured something out better than a switch statement by now; like storing the next lambda-ized chunk of your function as the handler of that select, so it gets called more or less directly. Dismissing such things as merely "syntactic sugar", is, short sighted.

I am not dismissing anything as "merely syntactic sugar". I am pointing out that you can obtain the same functionality without relying on a compiler/runtime library "black magic", which is
extremely relevant in the case of an embedded code. In that case we don't have neither the CLR from C# nor the Python bytecode interpreter or optimizations. What you are describing is relevant if you are developing for a desktop machine or a SoC with oodles of RAM and normal OS, so you can have threads and what not. Not bare metal or RTOS where you are memory/performance sensitive and the APIs are very basic.

Heck, a lot of developers don't/can't use even STL for embedded code for various reasons (such as relying on dynamic memory allocation) or exception handling.

Quote from: Fredderic on 13 Aug, 2017 04:06
Except that the compiler can take shortcuts, and/or decide to structure it's output differently to support that functionality. Doing it yourself by hand, you can't.

True, except when you are targeting an embedded system where bringing in the baggage of the runtime libraries required to support this is prohibitive.

Apropos, generator based coroutines in Python using yield exist at least since Python 2.5, I believe. Python 3 has only expanded the functionality.

#10 Reply
Posted by janoc on 13 Aug, 2017 10:34
Quote from: ejeffrey on 13 Aug, 2017 04:43
Well it gets you something in terms of correctness. Witness the evolution of python from sugar-free generator based coroutines to the yield from syntax to async def / await. That was in a large part driven by the difficulty of getting the semantics correct without language support. Mostly with respect to exception delivery and stack unwinding. Making it a language feature also means that the debugger can make sense of the stack -- which IMO overwhelms all other considerations. cooperative multi-tasking that is not part of the language / OS is not worth using.

True but none of that is really relevant in an embedded system running on bare metal/basic RTOS. I have yet to see embedded code that actually uses C++ exception handling - it is one of the first things to be disabled because of the large code size penalty.

Quote from: ejeffrey on 13 Aug, 2017 04:43
cooperative multi-tasking that is not part of the language / OS is not worth using.

Sorry, that's nonsense. Are you not using things like state machines only because they are not part of the language too? Or various data structures?

The main problem with cooperative multitasking in embedded system (or even C/C++) is that it is rarely useful, language support or not. Unlike that Python example. Python's coroutines can safely block on things like reading data from a socket or interacting with hardware, without preventing other coroutines from running (thanks to things like select()/poll() ). So from the programmer's point of view they behave closer to normal threads than to classic cooperative multitasking (where you must periodically call yield() and must never block or nothing else will run).

C# async/await system uses real threads (aka preemptive, not cooperative multitasking). A typical C/C++ coroutine will not be able to do that without a lot of support from the OS - and at that point you can as well use threads from your RTOS.

Having async/await equivalent in C++ supporting std::thread would be certainly nice, but given that embedded code rarely ever touches STL for all sorts of reasons, its usefulness would be really limited.

#11 Reply
Posted by agehall on 14 Aug, 2017 05:34
In my opinion, this sort of construct belongs in user-land. I would never want to see this in kernel code, which is essentially what we are talking about when developing embedded code. If you feel a need for this type of construct, you are most likely looking to develop user-land software for an OS, so go do that.

#12 Reply
Posted by Fredderic on 14 Aug, 2017 06:29
Quote from: janoc on 13 Aug, 2017 10:23
I am not dismissing anything as "merely syntactic sugar". I am pointing out that you can obtain the same functionality without relying on a compiler/runtime library "black magic", which is extremely relevant in the case of an embedded code.

10 years ago, most certainly, a no-brainer even. But more recently, that's much more of a vague sometimes, than it used to be. And almost always at the expense of readability, correctness, and debuggability, as someone else pointed out.

I used to feel much the same with asm over C, to the point of often using it mostly as a thin shell of convenience around my asm. Until one day I realised that the processors had gotten more complex, and the variations you needed to account for more broad, and that the sheer host of people that put development effort into these compilers, were actually collectively smarter than me after all. It's a humbling realisation for any developer, and like most younger people I felt like I was the king of my domain too, but the world continues to turn and grow, whether we like it or not, and we all have to grow up also, eventually.

Quote from: janoc on 13 Aug, 2017 10:23
In that case we don't have neither the CLR from C# nor the Python bytecode interpreter or optimizations. What you are describing is relevant if you are developing for a desktop machine or a SoC with oodles of RAM and normal OS, so you can have threads and what not. Not bare metal or RTOS where you are memory/performance sensitive and the APIs are very basic.

I should point out that I was responding to other people's comments also, which made certain assumptions not directly relevant to the more constrained 8-bitters. But I'd like to point out, that even more so, what the compiler does to the generated binary in the presence of some of those functions, such as yield (not so much with C#'s use of threading), would be a significant boon even in the more constrained 8-bit environments, and they're getting rather quickly to the point where you'd be hard pressed to do better by hand, without deploying at least a small measure of that syntactic sugary goodness.

I do agree that exception handling and a runtime are excessive baggage in most cases, which make these improved languages less useful on a small 8-bitter; but on even the smaller 32-bitters which are showing up in even simple devices, because of their incessant need to connect even the simplest of devices over the wild and wooly internet with it's SSL and RSA and what-not (think of the Amazon buttons, that I believe have to build those rather complex and crypto-heavy Amazon tokens, and then send that over an SSL link), even that issue is disappearing a lot faster than I've ever been particularly comfortable with, and on such devices, heap management is in fact again a very real thing (if not often mandatory). However, back on the 8-bitters, compiler support for yield type functionalities is still readily doable without the overhead of heap management, exceptions, or a runtime; it just seems more a case that all the new developments are being done in the bigger systems where they're needed (rather than down the small end where they'd be merely useful), and over in the bigger end of town things like exceptions are pretty much mandatory, and having large monolithic runtimes can actually be helpful (especially if they can be shared between several processes).

A quick detour: Over in D (which does have a rather large runtime, though I'd be willing to bet not nearly as nasty as C#), they've recently added support for designating functions to disallow automatic heap allocation (or rather, anything which will utilise the garbage collector, such as automatic heap allocation — a big thing for D, since it uses a lot of it), through their @nogc function tag, and even it's compiler directive version. More recently, they've gotten exceptions working again in the presence of @nogc (previously you had to also tag functions as nothrow for most vaguely useful code to compile, which cut down your standard library options to next to nothing), and I rather suspect that pattern will leak out into other aspects of the language in relatively short order. Now, again, yes, any of the bigger languages, including C# and D, do tend to make the assumption that you have large gobs of memory just sitting there waiting to be used, so if you use them the same way you would on a desktop, you're generally in for a bad time. But as you noted already, an embedded systems developer will often forego the STL, and likewise in C#/D, that means foregoing large chunks of the standard library. But the hardest part I've found in learning D in particular, is getting familiar with which of its various cat skinning techniques to apply when — you can choose from greedy or lazy processing, copying or non-copying semantics, using either high or low level functionality, and generally very finely craft the flow to meet your needs. There are also several people pushing D towards smaller and smaller devices, and as an embedded systems developer targeting a smallish device (it's far from 8-bit friendly still, although I think I remember reading that with some considerable effort, someone did manage to hack out most of the runtime as an experiment — so it's at least possible), you'd essentially focus on the in-place non-copying paths (much the same as you do any time you want good tight loop efficiency, even on a desktop), and maximising the compilers ability to reason about (and hence optimise away) the generated binary. If they could get it to shed much of that runtime (if not automatically, then with a compiler directive), I'm fairly confident many an adamant C developer, with a decent and appropriately focused proficiency in D, would be surprised at the results it achieves.

It's also worth mentioning that other big-end features like CTFE have already fairly recently made their way back into the C/C++ realms; though I've seen CTFE in C/C++, and it's still a rather hellish affair — I've not done particularly advanced C/C++ in a while, and even a simple example of CTFE took me a little while to reason my way through. But I could immediately see where it could have been applied with great success to several of my past projects. However, my point with Python wasn't so much the presence of the interpreter (and there's plenty of smallish 32-bit processors that run some form of either Python or BASIC), which makes certain things an awful lot simpler at the expense of memory usage, but rather the fact that things are not always quite what you assume them to be, which is the main point at which coroutines often become impractical in static languages. (Who here, honestly, would have assumed that a Python generator was actually simpler than a regular function call? I know one University Professor I was watching YouTube clips of teaching this stuff, certainly didn't, he went off on a merry chase trying to figure out where the complexity was buried, only to finally realise — even with me practically screaming at my monitor trying to point him towards it, as you do — it was the simple function call case that was the more complex one), but rather the pattern of its functionality; there's really very little reason C/C++ compilers couldn't take the Python approach to generators, at least, and apply it at the C level also. Python generators don't allow you to mess with the C call stack, since Python function calls each get their own private call stack independent of the C one — it's actually allocated as a simple array of PyObject's — and like the Python bytecode compiler, a C/C++ is also capable of calculating a functions own stack depth needs, or just plain switching to alternative means.) A C++ implementation of yield could be implemented essentially by breaking the function up into a class (or more specifically probably, a struct), with local variables and arguments needing to bridge function sections being shifted into the class structure. At first glance, you'd think you're going to have overhead, since all those variables are now indirect accesses, but keeping in mind that even regular function locals are stack-relative accesses, now, they're just class-instance-relative instead, which is only really a problem if you don't have a spare index register available… Allocating the structure on the heap, or simply placing it on the stack far enough prior to the call such that it's in scope for its whole lifetime, will generally make little difference from an efficiency perspective. Even better, if you're able to statically allocate that structure at compile time (which is effectively what you'll typically be doing if you're implementing all this by hand), it might even achieve better results with a decent CTFE/inlining compiler which can recognise that the only "this" pointer the construct ever sees, is a constant, essentially converting it all back down to the same thing you'd be doing yourself, but a heck of a lot cleaner.

Quote from: janoc on 13 Aug, 2017 10:23
Apropos, generator based coroutines in Python using yield exist at least since Python 2.5, I believe. Python 3 has only expanded the functionality.

That sounds about right. I'd like to see yield from ported back to 2.7 (or even better, the cause of my still being in 2.7 stepping up to use 3 instead), though, because doing the equivalent in current 2.7 is slightly tedious, and I have no doubt at least a little less efficient. Python 3's async functionality, however, looks awful tasty, I do deeply miss TCL's coroutines at times.

I do also totally understand why compiled languages like C# go for threads over coroutines; interpreted languages like Python and TCL have the benefit in this regard of an inherently split stack, where C…D and their ilk generally do not. Coroutines in such languages, end up as something more commonly called fibres, essentially cooperative threading. Where generators, on the other hand, can readily be implemented by a compiler as an efficient state machine, requiring zero stack trickery. Once again however, this does not reduce the applicability of the more basic higher-level functionalities such as generators (which are essentially a special case of coroutines here); classes and function decomposition can be used to encapsulate the necessary state, especially for these simple cases — you do it yourself when implementing a generator the hard way.

#13 Reply
Posted by tggzzz on 14 Aug, 2017 14:31
Quote from: agehall on 14 Aug, 2017 05:34
In my opinion, this sort of construct belongs in user-land. I would never want to see this in kernel code, which is essentially what we are talking about when developing embedded code. If you feel a need for this type of construct, you are most likely looking to develop user-land software for an OS, so go do that.

You aren't exactly clear what you mean by "this sort of construct"; many constructs have been discussed in this thread.

Apart from that and in the context of embedded code, what is this "kernel" of which you speak?

Is it part of some large operating system such as Windows or Unix or perhaps you mean Linus Torvald's contribution to the GNU ecosystem? Real men don't use kernels, but will use RTOSs.

#14 Reply
Posted by agehall on 14 Aug, 2017 19:25
I was referring to the original question about, i.e. if we need/want an await statement in embedded C/C++.

In terms of "kernel", I'm not talking about anything specific, more the concept that an operating system introduces. The point I was trying to make was that if you are developing for a small, embedded, system, you will most likely want to avoid this kind of syntactic sugar and maintain full control over everything in your code.

At some point, embedded systems become large enough that you actually want to run an OS on them, but that isn't really the type of systems we are talking about here imho.

#15 Reply
Posted by NorthGuy on 14 Aug, 2017 22:11
Quote from: hans on 12 Aug, 2017 11:32
Moreover if you dive into real time systems theory you'll find that non-preemptive scheduling algorithms can be one of the hardest systems to design. It sure is low resources, but it also highlights how big a quality of life thing a RTOS really is.

Is it? The only difference between cooperative (non-preemptive) multitasking and RTOS (preemptive) is that RTOS can switch between threads at any time, while cooperative multitasking allows switching only as a result of certain function calls.

The consequence of this is that in cooperative multitasking there's a danger that something runs for too long blocking everything else. It is sure a disaster for desktop OS, but in embedded you can solve it by using yield() or other functions which allow switching, but you certainly need to be vigilant to make your timing.

RTOS has the opposite problem - the execution can be preemted by other threads at any time, so if you want to access data at the same time from two different threads, you have to worry about using mutexes (or similar).

This is the fundamental difference which lets us tell apart RTOS and cooperative scheduler. This doesn't strike me as something which gives RTOS any advantage over cooperative multitasking - RTOS is just a different kind of monster.