EEVblog Electronics Community Forum

Products => Computers => Programming => Topic started by: SiliconWizard on November 03, 2021, 10:30:57 pm

Title: Fast, predictable memory allocator for embedded stuff
Post by: SiliconWizard on November 03, 2021, 10:30:57 pm
So, I'm looking for a fast dynamic memory allocator for embedded software, with predictable run time and reasonable behavior regarding fragmentation.
Do you have anything to suggest, any experience you wanna share?
Title: Re: Fast, predictable memory allocator for embedded stuff
Post by: Marco on November 03, 2021, 10:43:27 pm
If it can't move stuff around the only reasonable thing to expect is a fast OOM error. I don't see how you could do a defragmenting allocator in C without some terribly hacky macro shenanigans, but someone probably made one in C++.
Title: Re: Fast, predictable memory allocator for embedded stuff
Post by: Marco on November 04, 2021, 12:30:39 am
With a MMU you can use virtually over-allocated pools for fixed size blocks of geometrically increasing sizes and just throw the allocation in the smallest fitting pool. No fragmentation, just wasting some memory assuming randomly sized allocations.
Title: Re: Fast, predictable memory allocator for embedded stuff
Post by: T3sl4co1l on November 04, 2021, 02:38:53 am
If it can't move stuff around the only reasonable thing to expect is a fast OOM error. I don't see how you could do a defragmenting allocator in C without some terribly hacky macro shenanigans, but someone probably made one in C++.

Not so impossible, but slower by design -- namely that double indirect pointers be used at a minimum.  That is, so that one big fat table of pointers, can be edited by the manager, as needed (at any sequence point, i.e. where the targets of those pointers aren't being held -- preferably only the outer pointers are passed between function calls, say), thus allowing the underlying memory to be physically sorted as needed.

Believe I heard the original Mac did something like this (as a strong preference, or a stipulation of the languages ran on the OS?), so it's not without precedence.  But yeah, slow having to double-tap all those pointers, especially on platforms with sucky pointer arithmetic (like, on AVR I can already see that taking another maybe dozen cycles per pointer access).

Also, what's wrong with malloc() (assuming this is a C project, and libc is available)?  Just an unknown?  Ah, but any allocator is an unknown until familiarized with; the real solution is just lots and lots of reading, heh.

And yeah, fragmentation depends entirely on usage pattern; perhaps a closer inspection of your application will lead to useful insights, as far as what you need and when; and then perhaps a less general solution will present itself, maybe something with arrays of structs/unions for a semi-static allocation (i.e. the objects are statically allocated but their contents varies by use).

Mind, I've not touched an allocator before, so, grain of salt, and for the most part I'm just curious to see actual answers, in preparation for when I actually finally need to write/use one. :)

Tim
Title: Re: Fast, predictable memory allocator for embedded stuff
Post by: DavidAlfa on November 04, 2021, 12:55:14 pm
I had to deal with the internal fragmentation in the T12 firmware.
For example:
-Allocate 200+200+400 bytes.
-Free the first 2x200bytes.
-Allocating 300 bytes didn't use the first 400 bytes, it was allocated after the 400byte block. Malloc saw 2x200b blocks, not 400b.

What I did to fix this was to allocate the 100% at boot time and then release it.
After that, all allocations were packed together when possible.
Also, this could be a poor implementation from ST.
I'm not a expert at this, there's probably a better explanation.
Title: Re: Fast, predictable memory allocator for embedded stuff
Post by: DiTBho on November 04, 2021, 01:12:26 pm
What I did to fix this was to allocate the 100% at boot time and then release it.

That's equivalent to static allocation, which is safe-by-design and it's the fastest solution ever.

Ram is cheap nowadays, I have recently seen this working model applied to the firmware of an industrial sewing machine, 512Mbyte of DRAM, everything is static allocated at compile time.
Title: Re: Fast, predictable memory allocator for embedded stuff
Post by: David Hess on November 04, 2021, 04:48:03 pm
I have seen memory allocators for real time systems which return results in bounded time.  Wouldn't Knuth's power-of-2 allocator meet this requirement?
Title: Re: Fast, predictable memory allocator for embedded stuff
Post by: SiliconWizard on November 04, 2021, 05:48:16 pm
I've looked at various allocation strategies. Was curious to see other people's take on this before exposing them.

Just a small "disclaimer" here first: not considering dynamic allocation for targets with very little memory available. To give an idea, I'd start considering this for memory sizes in the MBytes range.
While dynamic allocation is often frowned upon for "embedded" development (the term actually covering a pretty wide range of systems), there are cases for which static allocation *only* wouldn't really cut it, or would at least make your code harder to maintain and not very pretty.

Of course, general-purpose allocators here are to be avoided. The standard malloc(), for instance, is designed to work reasonably well for general-purpose use. Your requirements if you're writing a desktop application or embedded software with more constraints are absolutely not the same! Also keep in mind that using standard allocators such as malloc() is often actually several layers of allocators. You have the malloc allocator which in turn will ask for more memory from the underlying OS if needed, which is yet another allocator (or several!) On small embedded targets, this last part usually is very minimal - typically the sbrk() function implements a simple linear allocator.

So yes, using allocators adapted to your particular use is probably the way to go here. We may even have to consider using several different allocators at the same time for different use cases.

One of the simplest allocators is the "linear" allocator. It doesn't even need any header for each block. Allocate memory blocks linearly as they are claimed. Of course you can't free them individually, but you can free them all in O(1). That may, at first look, seem no different from pure static allocation, but it is different! In some use cases, allocating memory dynamically is more elegant, and you can still reclaim memory (although all at once) if needed, while reusing memory when allocated statically is a lot clunkier. It's also an interesting solution if not all allocations are known at compile-time - which can be the case when they can be known only at run-time.

You can reclaim memory partially using this scheme, using markers. AFAIR, Turbo Pascal had such an allocator, with the Mark and Release procedures. It's usable if your allocations are locally grouped and the groups have a similar lifetime. That actually covers a lot more cases than one may think. Region-based allocators are an extension of this.

Of course, this is a lot more problematic if you need a multi-threaded allocator, although you can always allocate dedicated per-thread regions. This would be less efficient in terms of memory use, though.

A power-of-two allocator is also an interesting option.
Title: Re: Fast, predictable memory allocator for embedded stuff
Post by: rstofer on November 04, 2021, 06:08:49 pm
Maybe the FreeRTOS Memory Management page can help:

https://www.freertos.org/a00111.html (https://www.freertos.org/a00111.html)

There are 5 alternatives with increasing capability.  The source code is included with the distribution.

I would do everything possible to avoid having a heap.  I copy my string and conversion functions from "The C Programming Language" book (Kernighan and Ritchie) rather than use those from the C library.  I don't use printf() for the same reason.  I can usually get by with some form of puts() call.

I always look at the linker output and hope to not find _sbrk().  I won't provide one and I hope the environment doesn't provide one for me by default.  If the library doesn't include _sbrk() then it may be in syscalls.c provided by the IDE (eg STM32Cube)

I have no idea how you prove correctness when using dynamic allocation and an increasing call stack.
Title: Re: Fast, predictable memory allocator for embedded stuff
Post by: SiliconWizard on November 04, 2021, 06:45:04 pm
I would do everything possible to avoid having a heap.  I copy my string and conversion functions from "The C Programming Language" book (Kernighan and Ritchie) rather than use those from the C library.  I don't use printf() for the same reason.  I can usually get by with some form of puts() call.

I always look at the linker output and hope to not find _sbrk().  I won't provide one and I hope the environment doesn't provide one for me by default.  If the library doesn't include _sbrk() then it may be in syscalls.c provided by the IDE (eg STM32Cube)

Unless I missed your point, looks like you are directly associating dynamic memory allocation with the standard "heap" and the standard allocator.
I think I was clear in that I wouldn't consider using that. So, it would be a fully custom allocator (or set of allocators) having nothing to do with anything standard.
(That said, you're right saying that some C std functions do themselves call malloc() and thus should be avoided in that case.)

I have no idea how you prove correctness when using dynamic allocation and an increasing call stack.

You may want to elaborate on this a bit. Otherwise, as it is, all I understand is that you're concerned with how the stack and heap could grow into each other, something I think we already discussed when talking about stacks.

Avoiding the "heap" to grow into the stack, even using the std malloc(), is easy. You just need to write _sbrk() appropriately so allocation can't go further than a predefined end of the heap. That means you need to "reserve" space for the stack, even if you don't use it all. That's the only way of making things safe IMHO.

Avoiding the stack to overflow into the heap, OTOH, is a more severe problem, which we also talked about in threads about stacks. Various ways to ensure that, some requiring a MMU if you have one, some requiring more "manual" checking inside your code, which always has some added cost.

But even so, that's still thinking about it in very very standard terms, which I was not here. In the very basic standard memory layout, especially on small targets, there's only one "heap" - all memory between the end of the statically allocated memory and the end of the stack, and then the stack, growing downwards.

Of course, a completely different memory layout could be used here.
Title: Re: Fast, predictable memory allocator for embedded stuff
Post by: Jeroen3 on November 04, 2021, 07:19:52 pm
What do you need it for? Can you get away with writing a container class that performs as an adapter between the memory and application code?
For example, the qstring classes? When you do operations with them they internally work as linked list. (read, lots of small allocated blocks)
Title: Re: Fast, predictable memory allocator for embedded stuff
Post by: SiliconWizard on November 04, 2021, 09:19:07 pm
I forgot to mention pool allocators, of course, which can be put in the set of efficient allocators to be used in specific cases. They allocate blocks all of the same size, which makes it very easy and fast to handle.

Regarding the relative aversion to dynamic allocation, common in embedded development, we need to consider when it is reasonable and when it is just "magical thinking". The latter is true in many cases.

Like with any tools, we need to use the ones adapted to the application and use them with due care.
A common issue with dynamic allocation is that, due to its dynamic nature, many developers will tend to use it carelessly, because it makes things easier, rather than carefully. A bit like with dynamic typing.

Using the right allocators and the right techniques, I think you can absolutely make it safe and provable - using, for instance, pre-conditions, invariants and post-conditions, as with any other part of software.
Title: Re: Fast, predictable memory allocator for embedded stuff
Post by: SiliconWizard on November 05, 2021, 02:00:26 am
becasue even philosophically, it has its meaning: "eternalism" vs "presentism"

As you probably figured, you're going a bit far here. ;D

That said, there are underlying questions that can be interesting, as to what we define as dynamic and static from the POV of a program's execution.
Any "useful" program is dynamic in nature. It will take different paths of execution, will set variables and fill memory with different values, will modify the stack in different ways, etc.

Should memory allocation be considered fundamentally differently?
One thing first: using stacks is a form of dynamic allocation. As soon as you have a stack, you can't claim you're not using dynamic allocation at all..
But it's a form of "automatic" allocation. So it feels a bit safer. That should tell us something: it's a lot about the lifetime of allocated objects! Which is why allocators for bounded lifetime objects are often considered as an alternative to general-purpose allocators when performance and safety matters.

Which gets me to the point: the kind of allocators I'm thinking of here are those that deal with bounded lifetime objects, not only making the allocators simpler and faster, but also avoiding a whole range of issues related to objects lingering in memory with an ill-defined lifetime and the associated risks of leaks, or using an object that has beed freed...

Just some thoughts.
Title: Re: Fast, predictable memory allocator for embedded stuff
Post by: T3sl4co1l on November 05, 2021, 04:30:29 am
Kind of an interesting question to turn on its head: if your habit is to divide things into "eternal" vs. "present", why is that?  What about things that are intermediate?

Similarly, turn around the question: what's actually dangerous about memory allocation?

The answer may be much more related than you might otherwise think...

It sounds to me like the biggest problem with memory allocation is, no one teaches how to think about, or use or write, allocators, so people just think of them as magic.  (At least, that would seem a possible explanation.  I don't recall taking an undergrad course that discussed the topic; heck, the most relevant course I took used Java. So, uh, yeah?)  "Just go malloc() some memory, it always works. Except when it doesn't, but nevermind that, the calls have always returned successful for me!"  We grant ourselves these carve-outs of responsibility, a willful ignorance of things that are, at least not necessarily, any more complex than anything else we're working with.  And yet we have the audacity to complain when these things do finally go wrong.  Instead we must question such assumptions, and see if there might in fact be some underlying truth we missed: perhaps making a binary distinction was hasty, perhaps the real thing is multidimensional, or continuous, and we've sliced it the wrong way; etc.

But also, not that this is actually answering the question, so I digress. ;D

Tim
Title: Re: Fast, predictable memory allocator for embedded stuff
Post by: DiTBho on November 05, 2021, 10:26:40 am
Kind of an interesting question to turn on its head: if your habit is to divide things into "eternal" vs. "present", why is that?

Only because I usually prefer to study things that I can describe in phase space, a space in which all possible states of a system are known and represented, and each possible state corresponds to one unique point in the phase space.

If you want a "predictable" memory allocator, the space space offers the best way to mathematically study and describe it.
Title: Re: Fast, predictable memory allocator for embedded stuff
Post by: DiTBho on November 05, 2021, 10:45:51 am
what's actually dangerous about memory allocation?

you need to allocate a buffer for a task that must be completed in a deadline, and there is no free memory.

If we talk about the industrial sewing machine above, its control unit has to complete the texture-motion planning within a deadline, this requires to allocate a couple of buffers with critical things to do, if malloc returns NULL because al the memory is busy, you can't wait for a chunk of memory to be released as free (you can wait on a Linux desktop), so you can't allocate the buffer in time.

The result is catastrophic: 200 hundred sewing needles draw a funeral line on the curtain, making it garbage.
Title: Re: Fast, predictable memory allocator for embedded stuff
Post by: Marco on November 05, 2021, 11:01:09 am
Not so impossible, but slower by design -- namely that double indirect pointers be used at a minimum.
The issue is the syntactic sugar. With C++ you can add that as a variation of smart pointers, with C it will require a lot of macro hacking and the code becomes even more error prone than standard C.
Title: Re: Fast, predictable memory allocator for embedded stuff
Post by: DiTBho on November 05, 2021, 11:11:10 am
With C++ you can add that as a variation of smart pointers

any C++code-example of that?  :-//
Title: Re: Fast, predictable memory allocator for embedded stuff
Post by: Marco on November 05, 2021, 06:24:13 pm
any C++code-example of that?  :-//

Can't easily find any, guess I was wrong. Did find a paper on implementing a defragmenting allocator in C though ... they don't use macro's and will thus require even more finnicky and error prone application code.

http://www.fp7-save.eu/papers/SIES2016.pdf (http://www.fp7-save.eu/papers/SIES2016.pdf)

PS. no actual code, but it seems (https://programmerall.com/article/5594199315/) game/console programmers use defragmenting allocators in C++ with smart pointers.
Title: Re: Fast, predictable memory allocator for embedded stuff
Post by: SiliconWizard on November 05, 2021, 06:42:45 pm
what's actually dangerous about memory allocation?

you need to allocate a buffer for a task that must be completed in a deadline, and there is no free memory.

And as I, and T3sl4co1l said, this is all a matter of software design, and not an intrinsic problem.
What you mention becomes a problem if you use memory allocation as a magic tool.
What I agree with is static allocation only is much safer in the hands of people not mastering memory allocation patterns, and I admit this isn't a simple topic.
Title: Re: Fast, predictable memory allocator for embedded stuff
Post by: SiliconWizard on November 05, 2021, 06:47:25 pm
That said, the kind of allocators I mentioned are commonly used in embedded settings, or just when speed and predictablity matters.

The Linux kernel actually implements slab (in the form of power-of-two IIRC) and pool allocators, for instance.

So I' m just going to start with that: implementing linear (also called "bump"), slab and pool allocators. And think of how to best use them and of appropriate allocation patterns. Certainly, it requires a different approach than just using malloc() frantically every time you just need some memory without giving it another thought.

https://en.wikipedia.org/wiki/Memory_management
https://en.wikipedia.org/wiki/Region-based_memory_management
https://en.wikipedia.org/wiki/Slab_allocation
Title: Re: Fast, predictable memory allocator for embedded stuff
Post by: DiTBho on November 05, 2021, 06:51:30 pm
This is all a matter of software design, and not an intrinsic problem.

static allocation is safe by design, dynamic allocation always exposes a probability of disaster due to the  Murphy's law, it can be demonstrated mathematically.
Title: Re: Fast, predictable memory allocator for embedded stuff
Post by: SiliconWizard on November 05, 2021, 06:55:38 pm
Believe I heard the original Mac did something like this (as a strong preference, or a stipulation of the languages ran on the OS?), so it's not without precedence.

It absolutely did that. I programmed for Mac OS at the time. The allocator would return "handles" to memory blocks. You'd get the actual pointer to the block - strictly when needing to access it - using some kind of Lock() function. It would lock the block and return a pointer to it. Then you would unlock it as soon as possible. Once unlocked, the OS could move the block around to defragment memory.

Remember that the first Mac had only 128 KB of RAM, so that was necessary.

I think a similar approach existed on Windows as well. IIRC, the API provided at least one allocator that would behave like this.

And "kids" these days think that manual memory management with malloc() or the like is hard... :-DD

Of course that "solves" fragmentation issues, but doesn't solve the predictability problem, on the contrary.
Title: Re: Fast, predictable memory allocator for embedded stuff
Post by: SiliconWizard on November 05, 2021, 07:11:43 pm
This is all a matter of software design, and not an intrinsic problem.

static allocation is safe by design, dynamic allocation always exposes a probability of disaster due to the  Murphy's law, it can be demonstrated mathematically.

And, again, as T3sl4co1l and I said...

- You can also prove mathematically that a given function returning a constant is much safer than a function returning a value based on a set of inputs and a complex calculation.
- You can also prove that the probability of bugs is much lower if your whole program has only one execution path rather than several (but that program would probably not be that useful...)
- Again using a stack is dynamic allocation. Interestingly enough, many people seem to either ignore it, or just think that it can be risky only if using recursion.

So, this is interesting.

Even more interesting is to, OTOH, find use cases for which dynamic allocation (done properly with the right allocators) could actually be safer than just relying on static allocation. Or even if not safer, could at least be the only reasonable way of implementing things.

So, that's some food for thoughts. That said, my thread is not about finding reasons why dynamic allocations should not be used. Otherwise I wouldn't have started it. If I'm considering it, there is probably a reason. If I was interested in reading all the reasons why it should be avoided, the material about this is not missing...

Oh, and I think the example I gave with the Linux kernel is sort of interesting.
Try and rewrite it with static allocation only, and see how far you go and how scalable it will be.
Title: Re: Fast, predictable memory allocator for embedded stuff
Post by: Siwastaja on November 05, 2021, 07:23:36 pm
- Again using a stack is dynamic allocation. Interestingly enough, many people seem to either ignore it, or just think that it can be risky only if using recursion.

This is a fair point. I see it like experienced programmers coming up with a hand-wavy (but indeed, unscientific) way of leaving a lot of margin, like reserve 32KB for stack and only use a few dozen bytes per function and no recursion or very deep call chains, and anything beyond a, say, 16-byte buffer, go for static storage instead.

This can be made quite robust with static analysis tools that can understand worst case code paths. Or, just trusting the experience of the programmers combined with a lot of margin.

I don't think I have had stack overflow in... 15 years? The strategy for using stack only for a few local variables, no larger buffers at all, is simply clearly working.

Am I wasting more RAM statically allocating buffers? Maybe, that's the obvious downside of static allocations.

Using unions in the "original" fashion is a pretty interesting way of saving memory in the obvious cases where static allocations are different in different parts of software between which buffer retention is not needed. The almost grotesque explicit nature is the safety net here; you need to document it so explicitly that making a mistake is somewhat unlikely.
Title: Re: Fast, predictable memory allocator for embedded stuff
Post by: DiTBho on November 05, 2021, 08:53:25 pm
To be honest, you are right on one point: mine is a conservative approach that doesn't overestimate my analysis capability on a firmware that is already intrinsically complex for its-own algorithms.
Title: Re: Fast, predictable memory allocator for embedded stuff
Post by: T3sl4co1l on November 05, 2021, 11:04:02 pm
Kind of an interesting question to turn on its head: if your habit is to divide things into "eternal" vs. "present", why is that?

Only because I usually prefer to study things that I can describe in phase space, a space in which all possible states of a system are known and represented, and each possible state corresponds to one unique point in the phase space.

If you want a "predictable" memory allocator, the space space offers the best way to mathematically study and describe it.

To be clear, that should be read in a philosophical, not critical, tone. :)  And, indeed, phase space is a fascinating tool, quite literally sideways to our common thinking, difficult to grasp at first, but powerful once understood.

Downside here is, the space is discrete, the state-transformation function may be highly nonlinear, and the number of points is utterly intractable -- so you'll have to "study and describe it" in some way to hopefully reduce that to a more feasible scale.  (Relatively trivial case: the algorithm should be independent of allocated contents, reducing the space from "total memory", to a few kB or so of tables; depending on algorithm, size, and number of allocations.  ~kB still isn't exactly tractable, but I mean, at least it's not megs.)

Or, conversely: design in that simple, tractable structure from the beginning.  (More formally speaking: prove it by induction, start with a simple case then expand.)

For an analog example, that's how I design circuit boards: by laying everything out in a certain way (ground planes most importantly), I can maintain certain well-behaved assumptions about the EMC characteristics of the system.  It can be reasoned about, it's not a highly coupled system, and it's relatively easy to adjust/fix if better performance is needed (e.g., adding filters or shields).


On a much more tangential note: I feel this is where big numbers actually get scary.  Real numbers -- poorly named as they are, simply aren't real at all, nothing can be known to perfect precision; they are utterly unknowable, intangible.  We only know a few through their relatively trivial relationships with geometric objects (like pi and e), or computable by algorithm to arbitrary precision.  Whereas, we can count to all sorts of large numbers (integers, rationals, whatever, they're equal cardinality), and we use them on, not just a daily basis -- but in a very real sense, we are made of them!  Yet we can fully enumerate systems of only 48 bits or so, and that with difficulty (parallel processing and months of time?).  Much more than that is really not feasible to check exhaustively.  And yet that's a tiny, tiny amount of memory.  And there are only so many things we can do, to tame the combinatorial explosion; beyond which lies true madness, something utterly ungraspable by human mind alone.

Tim
Title: Re: Fast, predictable memory allocator for embedded stuff
Post by: Doctorandus_P on November 06, 2021, 03:19:14 pm
I also prefer to use static allocation for embedded software.
Guaranteed no fragmentation, even when the program runs for years.

Embedded software also tends to be small, so there is not much need for dynamic memory allocation.

Sometimes it's handy to declare a relatively big array, and then re-use that array for different sorts of data in different parts of the program. You can even use an array of a union if you want to use different types for your data in those different parts.

Edit:
This method does not work if those different parts of the program need that memory at the same time. (I thought this was so obvious that I did not have to mention it separately)
Title: Re: Fast, predictable memory allocator for embedded stuff
Post by: PlainName on November 06, 2021, 05:23:08 pm
Quote
and then re-use that array for different sorts of data in different parts of the program

The risk there is that the different parts might get to want the memory at the same time, perhaps after some later rework.

I think we might be flopping between extremes, assuming all allocation is either dynamic or static. Where I use dynamic allocation it is specifically where the design can cope with not getting it on demand (and code specifically written to handle that case). But the same design will have the usual statically allocated memory where it's assumed the memory will always be there. In essence, you use the right approach for the particular feature, not a global all or nothing per design.

I also think that re-use via a union or whatever is worse than dynamic allocation, since there is likely no mechanism to cope with the memory not being available when needed. However, having said that, I've used compilers that did code analysis and then overlaid function variables where those functions couldn't happen at the same time. But this - being done by compiler - is a lot different because it will change every time you compile to ensure no conflict, whereas manually doing it means the current programmer has to be au fait with the peculiarities of  the design and remember to fix any issue that may or may not occur.
Title: Re: Fast, predictable memory allocator for embedded stuff
Post by: SiliconWizard on November 06, 2021, 06:18:54 pm
Quote
and then re-use that array for different sorts of data in different parts of the program

The risk there is that the different parts might get to want the memory at the same time, perhaps after some later rework.

Absolutely. I was going to say the same.
Funny thing is, this is actually also a form of dynamic memory management. It's just an ad-hoc one.

In the same vein, declaring a static array of objects, and using it to store more or fewer objects in it at run-time *is* a form of dynamic allocation. It's actually equivalent to a pool allocator.

As I said, it looks like many people actually just have prejudices about what dynamic allocation is, and can't see past this. So if you talk about dynamic allocation, it's only going to be the standard heap and unreasoned use of a malloc-like function asking for more memory at any time in any order. Nope. That's just one particular way of doing it - the "easiest" way for general-purpose use. But there are many other ways of using memory dynamically, and as we saw, in some cases, people think they strictly use memory statically while this is actually not quite true.

Using unions for aliasing various data in the same memory area is a good example of that. Oh, I have done it as well, usually in contexts with very little RAM available. But was that without risks? Nope.

I think we might be flopping between extremes, assuming all allocation is either dynamic or static.

For one  yes, because it's always gonna be a mix of both.

But, yes, and not just that, as I tried to explain. Whatever is dynamic seems to also be ill-understood or at least defined in very restricted ways without necessarily looking at the bigger picture.
As I said just above, sometimes it even looks like many will be dead sure they only use "static allocation" as long as they never call a malloc() - or equivalent. This isn't the case. The extreme of this approach is bordering magical thinking again. Like: "If I'm not using malloc(), I'm safe with memory allocation. If I'm not using recursion, I'm safe with the stack." Sure.

This thread was trying to be about various dynamic allocation strategies, more dedicated to specific use cases than the general malloc(). It's a bit sad very few here haven't missed the point.
I also gave examples of what to look at, for instance how memory is managed in the Linux kernel, and actually in kernels of most OSs. I also mentioned "embedded" covers a wide range of systems.

Anyway, I talked about lifetime of objects in memory and how I think it's really all what it is about. As we just discussed, statically allocated objects are usually seen as having "permanent" lifetime - so it looks safer, because in this case you don't have to consider lifetime at all - but in actual software, this isn't exactly how things work. In particular when we deal with dynamically-filled arrays or unions.

In the end... talking about the topic a little bit more would be interesting, instead of just ranting about how we should not even talk about it. At least to me. Now if it doesn't interest anyone else here, that's life.
Title: Re: Fast, predictable memory allocator for embedded stuff
Post by: SiliconWizard on November 07, 2021, 07:52:00 pm
So anyway, to wrap it up a little at this point, here is what I've come up with so far. If anyone's interested.

I implemented a linear allocator with a mark/release feature and alignment request for each block allocated. Very simple but allows to control the lifetime of a group of objects, and free them all at once.
On top of that, I implemented a pool allocator. This is basically an allocator that can only allocate fixed-size blocks. More useful that one may think. It's very common, especially in "non-general-purpose" software to have to allocate/ free many objects of the same type. This kind of allocator is used in OS kernels, but also commonly in game engines.

Both are O(1). I've written test code for them. Now I'll write some real-use code to see what can really be done with just that, how useful it is and what would be missing.

I might also implement a buddy allocator after that - typical are the power-of-two ones. Can also be found in the Linux kernel. Benefit is that it allows to allocate and free blocks of non-fixed size in any order (as opposed to the linear allocator), it's also pretty fast, but drawback is that it makes less efficient use of memory. But I'm not even sure this kind of allocator would be needed.

The point is to have the flexibility of dynamic allocation when it is appropriate to use it - typically when you manipulate data that is dynamic in nature - while remaining simple, fast and predictable. Something that can complement simple static allocation while not having most of the drawbacks of general-purpose heap allocators.

For the two classes of examples we mentioned earlier, namely unions for reusing memory areas and fixed-size arrays used dynamically in your code, I think this would offer a more flexible and more elegant solution. One way of seeing it is, as we discussed above, that it's very common to implement ad-hoc dynamic allocation on top of statically allocated memory in practice (and still considering it static), but ad-hoc implementations imply writing the same kind of things over and over again, with the risk of getting it occasionally wrong. So a small, verified library should help limiting the amount of ad-hoc allocation schemes while improving corrrectness.

Of course, as I said, it's not meant for very small targets. There is some overhead, and this overhead has to be justified. I'd typically consider using this, as I said, when at least 1 MB of RAM or so is available.

Still interested in getting opinions about this. As long as they are not limited to "don't do it". Or when they are, at least give concrete examples we can further discuss and see what kind of approach would be more problematic.
Title: Re: Fast, predictable memory allocator for embedded stuff
Post by: brucehoult on November 07, 2021, 10:04:29 pm
Still interested in getting opinions about this. As long as they are not limited to "don't do it". Or when they are, at least give concrete examples we can further discuss and see what kind of approach would be more problematic.

I have a few opinions on this. (Don't I always?)

Relevant experience is writing the memory allocator/GC for a J2ME to native (ARM, mostly ARM7TDMI) compiler for mobile phones in 2006, running Java games (and other apps) on phones with sometimes as little as 390 KB, though 1 to 2 MB was more common.

A couple of quick points:

1) execution can be divided into two phases: app startup, and the event loop. It's good to be able to detect the boundary between them and free up as much as humanly possible of the junk from the startup phase.

2) patterns of allocation in the event loop phase are (duh) cyclical, and it's very very important that repeated patterns of the same allocations and deallocations don't continuously create fragmentation and new previously untouched memory to be used.

3) part of achieving 2) is if a large memory block if freed NEVER split it up and use it for smaller allocations. Assume that there WILL be a later allocation of the same size block that will reuse it.

4) avoid allocations of large randomly-sized memory blocks, especially with sizes determined by runtime data. Set a fixed maximum object size (256 bytes, 1K, 4K, whatever) and split larger arrays and strings into a list or tree of chunks if you can't just avoid them altogether. (simply ban very large structs -- that was no problem compiling Java where structs can contain only pointers to arrays, not embedded arrays)
Title: Re: Fast, predictable memory allocator for embedded stuff
Post by: SiliconWizard on November 08, 2021, 03:13:03 am
Your input is always appreciated.
You're talking about a rather general purpose allocator, here, as far as I understand. So again that's not quite what I'm after. Sure we all would (or would we?) like a general-purpose allocator that is the perfect, O(1), non-fragmenting allocator and can allocate blocks of arbitrary size, but that just doesn't exist. So I'm considering and focusing on specialized allocators. Of course they do impose particular allocation patterns, which IMO here is a good thing, but it's also a significant constraint, thus that would be much harder to apply to more general-purpose "computing".

With that said, your points do confirm that it's pretty much all about object lifetime, and favoring fixed-size allocations, as I pointed out before. That's indeed the direction I'm taking.
Title: Re: Fast, predictable memory allocator for embedded stuff
Post by: brucehoult on November 08, 2021, 04:47:10 am
Your input is always appreciated.
You're talking about a rather general purpose allocator, here, as far as I understand. So again that's not quite what I'm after.

No, not really.

I think there are very strong parallels to be drawn between games on restricted mobile devices and typical embedded applications. Both are structured around an event loop and run, potentially, forever, with cyclic patterns of allocations and deallocations and a fairly fixed upper bound on the total memory actually being used. The thing you absolutely want to avoid is a cyclic pattern repeated thousands or millions of times slowly growing the heap without bound.

This is very different -- and uses a different allocator -- from programs such as compilers, or business batch update tasks, or a process that is started up to service a web request.
Title: Re: Fast, predictable memory allocator for embedded stuff
Post by: Nominal Animal on November 08, 2021, 05:02:49 am
Even more interesting is to, OTOH, find use cases for which dynamic allocation (done properly with the right allocators) could actually be safer than just relying on static allocation. Or even if not safer, could at least be the only reasonable way of implementing things.
One of my favourite examples is fgets() versus getline().  The former uses a fixed-size buffer (or one you can reallocate yourself), and the latter uses a dynamically allocated buffer.  And since the typical use case is
Code: [Select]
    char   *line = NULL;
    size_t  size = 0;
    ssize_t len;

    while (1) {
        len = getline(&line, &size, stdin);
        if (len < 0)
            break;

        /* Use line; the pointer may vary from iteration to iteration.
           If you steal the dynamically allocated buffer, set line=NULL, size=0.
        */
    }

    free(line);
    line = NULL;
    size = 0;

    if (ferror(stdin)) {
        /* Read error occurred */
    }
it's not like it's that much more code, either.  You can even detect embedded nul bytes in the input (strlen(line) != len).

With that said, [brucehoult's] points do confirm that it's pretty much all about object lifetime, and favoring fixed-size allocations, as I pointed out before. That's indeed the direction I'm taking.
I find myself doing that as well.  If I don't have specific knowledge of the object lifetime or size, I just use the default general-purpose one, until I know more.

In the past, I mentioned using pool allocators in service daemons; there, all objects allocated for the purpose of a specific request can be discarded when the request is completed.  For example, Apache uses this approach (see APR (http://apr.apache.org/docs/apr/1.7/group__apr__pools.html) for details).

Fixed-size allocations do not suffer from fragmentation, never need to be split or coalesced; this alone is sufficient for accepting at least some "waste" (using larger fixed-size allocations to accommodate smaller sized requests).

In systems programming, like the getline() example above, realloc() is surprisingly often called.  (Obviously, there are workloads where almost all allocations are final, too; I just meant that it is worth it to examine how the allocation size changes in a specific workload during its entire lifetime.)  The only cases where an allocation is shrunk is very soon after populating it, with any further changes to the size extremely rare.  (Typical example is when reading a file or all input from a descriptor or socket.  As more and more data is read, the contiguous memory chunk is reallocated larger and larger.  When the end of input is reached, the data is padded/terminated with some nul bytes, and reallocated to a suitable final size.)

The bucket brigade approach –– for example, splitting larger objects like arrays and strings into a sequence of sub-objects containing a varying amount of data in a fixed-size container –– is more efficient, but programmatically much more complex.  The two main users I'm familiar with are Apache and the seq_file (https://www.kernel.org/doc/html/latest/filesystems/seq_file.html) interface in the Linux kernel.  In the Linux kernel, it's often used with e.g. pseudofiles (the interfaces where the kernel provides run-time information using a file-like interface under /proc (procfs) or /sys (sysfs)).

(When musing about the various approaches to replacing the standard C library with a more useful API, the bucket brigade approach (and generators/iterators in particular) have been on my mind, but I do not think they warrant the increased complexity in the typical use cases.)

I think there are very strong parallels to be drawn between games on restricted mobile devices and typical embedded applications. Both are structured around an event loop and run, potentially, forever, with cyclic patterns of allocations and deallocations and a fairly fixed upper bound on the total memory actually being used. The thing you absolutely want to avoid is a cyclic pattern repeated thousands or millions of times slowly growing the heap without bound.
Fully agreed.  These often have rather unique patterns to the sizes of the allocations, too, so simple rules of thumb are unlikely to work.

Growing the heap (or memory-mapped pools) without bound does not imply a memory leak, either: all you need is a pattern of allocations and releases that interacts badly with your free space coalescing and allocator selection routines, and you end up with holes in your heap that are too small to use for new allocations, with releases only adding new holes (sufficient only for same-sized or smaller allocations).  Nasty.  (This is something I have not investigated myself, and have left for the general-purpose allocator designers and implementers to worry about.)
Title: Re: Fast, predictable memory allocator for embedded stuff
Post by: brucehoult on November 08, 2021, 05:56:37 am
Growing the heap (or memory-mapped pools) without bound does not imply a memory leak, either: all you need is a pattern of allocations and releases that interacts badly with your free space coalescing and allocator selection routines, and you end up with holes in your heap that are too small to use for new allocations, with releases only adding new holes (sufficient only for same-sized or smaller allocations).  Nasty.  (This is something I have not investigated myself, and have left for the general-purpose allocator designers and implementers to worry about.)

As I said a couple of posts up: never take a free'd memory block and spit it up to satisfy a smaller memory allocation. Always assume that whatever needed that large memory block before will need another one the same size again later.
Title: Re: Fast, predictable memory allocator for embedded stuff
Post by: Nominal Animal on November 08, 2021, 07:14:44 am
As I said a couple of posts up: never take a free'd memory block and spit it up to satisfy a smaller memory allocation. Always assume that whatever needed that large memory block before will need another one the same size again later.
Like I said, I haven't investigated this myself; it could indeed be that simple.

What really, really interests me in this topic, is whether usage hints to the allocator could make a measurable difference: in particular, that a specific allocation/reallocation is considered almost certainly "final"; likely to be grown; and whether it might be useful to obtain the actual allocated size; and if letting the allocator decide how much to grow an allocation when "more" memory is needed, with a minimum or hint given as a parameter.  The current dynamic memory management interfaces (malloc()/realloc()/free() or new()/delete()) are, in my opinion, rather simplistic.  At least a hint "will-grow" or "will-shrink" via a flag should be useful; maybe also grouping allocations (for pools or groups that can all be released at once).  Dunno for sure; I just find it interesting to try and work out.

Even an interface that takes an array of pointers, and optimizes those dynamically allocated regions, might come in handy: essentially an user-initiated memory defrag.

So, @SiliconWizard, do you already have an API or interface you want your implementation to conform to?  As shown by Apache development history (and the way Linux kernel seq_file_path() still does not escape backslashes in file names, making it possible to fake paths/mislead users re. paths containing backslashes), it is not always possible to get even the best developers to understand how to use the interfaces properly.

And then you get to the question of how much effort is appropriate here, considering how well already existing Garbage Collectors perform?  (I have no answer or opinion to that myself.)
Title: Re: Fast, predictable memory allocator for embedded stuff
Post by: brucehoult on November 08, 2021, 08:21:56 am
The bucket brigade approach –– for example, splitting larger objects like arrays and strings into a sequence of sub-objects containing a varying amount of data in a fixed-size container –– is more efficient, but programmatically much more complex.  The two main users I'm familiar with are Apache and the seq_file (https://www.kernel.org/doc/html/latest/filesystems/seq_file.html) interface in the Linux kernel.  In the Linux kernel, it's often used with e.g. pseudofiles (the interfaces where the kernel provides run-time information using a file-like interface under /proc (procfs) or /sys (sysfs)).

Those are solving different problems. In particular, the main problem being solved by the Apache data structure is efficiently inserting / deleting / replacing sequences of bytes within the "array". That's more like a rope/cord data structure: https://en.wikipedia.org/wiki/Rope_(data_structure) (https://en.wikipedia.org/wiki/Rope_(data_structure))

Both of those are build on top of arbitrary-length arrays, which are probably just allocated with standard malloc() or similar.

The more fundamental task is to efficiently represent arbitrary-length arrays that are "large" with respect to the amount of memory available.

The data structure I've used in several projects to do this is a tree with exactly two levels, which I will call a SegArr:

1) a set of memory blocks containing the actual data, with every block except (possibly) the last one being of a fixed and known size, the largest size contiguous block you're prepared to allocate. All blocks except the last one are full.

2) a single memory block containing an array of pointers to the blocks with the data, plus at least one item of metadata: the number of data bytes currently allocated (which implies the number of data blocks, the size of the last block, and the number of pointers in the root block (and therefore the size of block allocated to hold it)

In any given program the block size and the metadata present will be the same for every SegArr object.

The block size should be big enough to hold enough pointers so the data blocks for an object can use as much of physical or virtual memory as you anticipate needing for the largest array/string you'd ever want. This is in general sqrt(MaxSize * sizeof(char*)).

For example, on a Cortex-M3 with 16 KB of RAM and 4 byte pointers, it might make sense for the block size to be sqrt(16384*4) = sqrt(65536) = 256 bytes. The maximum array/string you could have would then be 256 bytes of pointers, minus one for the metadata, equals 63 pointers. Each pointer (except maybe the last one) points to a block of 256 bytes. So the biggest object possible is 63*256 = 16182 bytes.

Your memory allocator would probably support blocks of size 16, 32, 64, 128, and 256 bytes. It would obtain these blocks from pages of 256 bytes, with each page split into blocks of equal sizes. Two bytes of metadata for each page is sufficient to hold a bitmap of which blocks are in use (with 16 byte blocks, there are 16 blocks in a page).



Accessing a single random value in the SegArr is a little more complex than for a normal C array, but not all that much. Ignoring metadata to make it more apples to apples, instead of...

Code: [Select]
char ch = ptr[idx];

... you need ...

Code: [Select]
char ch = ptr[idx/256][idx%256];

For loops that process a lot of data you use an iterator interface that loops over (ptr, len) pairs and you have barely any overhead (lines of code or instructions executed) compared to using a normal C array.
Title: Re: Fast, predictable memory allocator for embedded stuff
Post by: brucehoult on November 08, 2021, 08:26:07 am
The current dynamic memory management interfaces (malloc()/realloc()/free() or new()/delete()) are, in my opinion, rather simplistic.  At least a hint "will-grow" or "will-shrink" via a flag should be useful; maybe also grouping allocations (for pools or groups that can all be released at once).  Dunno for sure; I just find it interesting to try and work out.

realloc() is evil. Don't use it. Don't try to grow things in place. (except if you know the memory allocator gave you N bytes and you know you're using only M, and you want to grow it to something else also less than N -- but that's for you to know, not the memory allocator.)

Use SegArr, as described in my most recent comment.
Title: Re: Fast, predictable memory allocator for embedded stuff
Post by: Kalvin on November 08, 2021, 09:42:27 am
Without a proper memory consumption analysis and measurement, the system will have either too much of RAM or too little RAM. If the RAM size is estimated too low, the selected memory allocator will eventually fail. If the RAM size is estimated hugely too high, the system may become too expensive or the power consumption may be too high, or the memory allocator may just fail at later time. Of course, too much RAM is not a catastrophe from the technical point of view. But not knowing the RAM consumption is a huge system design failure.

Using a traditional malloc() is not a recommended approach for a system that needs to be reliable and deterministic, due to possible memory fragmentation issues.

Measuring and analyzing the overall memory consumption (stack and heap) and heap's memory block size usage for a typical use-case and for worst-case scenarios will give a good engineering information how much RAM will be needed and what kind of memory allocator is required.

The systems engineering has to be prepared that something unexpected may and will happen. A little paranoia will not hurt. It is a good engineering practice to expect, and to be prepared, that the memory allocation will fail anyway for some unexpected reason at some unknown point in time. Thus, there has to be a plan how to prevent excessive memory consumption in the first place, how to monitor memory consumption, and how to prevent the memory fragmentation. Also, there has to be a way and policy how to detect the memory allocation failures, and how the device shall handle those situations. Lastly, there shall be an analysis what will the consequences be if such event should ever happen.

I have used static allocation and memory pools successfully in embedded systems. Adding instrumentation in the system's application code is not too difficult, and it will give good inside information for the RAM usage of the use-cases that the system has executed. Since my applications has not been mission-critical in a sense that system failure will not result life-threating or catastrophic incident, I typically will just reboot the system in a controlled manner with some diagnostic code saved for later analysis, if something unexpected should happen.
Title: Re: Fast, predictable memory allocator for embedded stuff
Post by: Kalvin on November 08, 2021, 10:07:01 am
Memory allocation failure may provide similar surprise like the issues with the Boeing 787 Dreamliner and Airbus A350, for example: It is required that the 787 Dreamliner shall be rebooted before the system's uptime exceeds 248 days. Same thing with Airbus A350 software bug, which forces airlines to turn planes off and on every 149 hours. These were not due to memory allocation failures, but due to the fact that a 32-bit timer will overflow after certain uptime, and cause "partial or total loss of some avionics systems or functions".

The common thing between memory allocation problems and the problems with these planes is that something unexpected is happening, which was not properly analyzed during systems design phase, and which went unnoticed until the product was manufactured and deliver to the customer.
Title: Re: Fast, predictable memory allocator for embedded stuff
Post by: Nominal Animal on November 08, 2021, 10:49:10 am
realloc() is evil. Don't use it. Don't try to grow things in place.
Use SegArr, as described in my most recent comment.
I disagree: the SegArr just shifts the complexity to different code.

I would prefer an allocator that would let me make a tentative allocation whose size is likely to change, resize it as needed, and finally finalize its size (making it unlikely to change in size afterwards).  It is a common pattern, for example when reading input (especially from standard input, pipe, or socket).

One can do malloc()+free() instead (and I do that for example when resizing hash tables), and expecting oldptr==newptr after newptr=realloc(oldptr,size) is utterly silly (and I consider such a bug), but I refuse to consider realloc() evil per se.

In practice, my processes typically have at most one such growing allocation at a time, with all other realloc()s rare enough that I'd be happy to consider them as malloc()+free() instead, as long as that growing allocation will not cause memory fragmentation etc. issues.  I'd be happy to even let the realloc() round the new size up if it wants to, for efficiency.

I do not know how common those patterns are in embedded environments, though.  I'd expect anything that reads say human-readable configuration files to do something similar.
Title: Re: Fast, predictable memory allocator for embedded stuff
Post by: SiliconWizard on November 08, 2021, 06:25:47 pm
So, @SiliconWizard, do you already have an API or interface you want your implementation to conform to?

Absolutely not, and that's the beauty of it. At the moment I'm just trying to write a set of specialized allocators that can be used each for specific use cases. As you saw, I'm not trying to reinvent the wheel either; I've - at least for now - implemented well known allocators, such as linear (also called "arena") and pool allocators.

Since I'm considering their use in contexts in which either dynamic allocation is rarely used - for reasons we already exposed - or, if used (sometimes unproperly), just with standard allocators, I'm pretty much free to implement whatever interface is simple and does the job.

And then you get to the question of how much effort is appropriate here, considering how well already existing Garbage Collectors perform?  (I have no answer or opinion to that myself.)

The effort, actually - as you can guess - is not much about implementation. It took me just a couple hours at this point to get something working well. Sure it's just still a WIP. The effort is all about using allocation properly and choosing the "right" allocation patterns.

Sure using a garbage collector may look like an answer to everything, like 42. You can in theory just allocate things of any size at any time without ever having to worry about lifetime and freeing them. But that's kind of the opposite of what I'm looking for. GC can't be predictable by nature. Sure there are some implementations that allow configuring behavior so that you have more control over them, such as when they'll be collecting garbage, but in this case they real benefit becomes questionable. Keep in mind I'm not thinking about *general-purpose* use cases, for which I may tend to agree than GC is not so bad.

GCs, even the "best" ones, for anything timing-critical is frankly out of the question in general. I just want allocators - at this point at least - that are ideally O(1), the execution of which thus doesn't depend on state, such as what had already been allocated or freed.

The whole point here again is more about allocation patterns than about the allocators themselves, in the end. The required allocators are then just the right tools for the allocation patterns we decide to use. And as I said before, while specialized allocators may look very unflexible, that very constraint forces us to think about allocation patterns. Which IMO is a good thing. The pitfalls of general-purpose allocators come from their very flexibility: we want to use them without having to work on object lifetime whatsoever, the worst in that regard being GCs!

Now could this be a drop-in replacement for malloc (/new/...)? Absolutely not.
And yes, I'm not really reinventing any wheel either. This is already what is done in many kernels, game engines, and other applications. I'm just trying to apply that to "common" embedded development - for which, as you saw, dynamic allocation as such is usually rejected, while it seems that people rejecting them haven't taken a look at the above use cases - and see how that goes.
Title: Re: Fast, predictable memory allocator for embedded stuff
Post by: SiliconWizard on November 08, 2021, 06:37:01 pm
Your input is always appreciated.
You're talking about a rather general purpose allocator, here, as far as I understand. So again that's not quite what I'm after.

No, not really.

To me, it is. Of course I haven't seen what you implemented exactly. But any allocator that would allow to allocate and free arbitrary-size blocks without constraints is a general-purpose allocator.
This may of course be a matter of definition.

I think there are very strong parallels to be drawn between games on restricted mobile devices and typical embedded applications.

Of course, and I mentioned them as well!

Both are structured around an event loop and run, potentially, forever, with cyclic patterns of allocations and deallocations and a fairly fixed upper bound on the total memory actually being used. The thing you absolutely want to avoid is a cyclic pattern repeated thousands or millions of times slowly growing the heap without bound.

This is very different -- and uses a different allocator -- from programs such as compilers, or business batch update tasks, or a process that is started up to service a web request.

Again we're talking about allocation patterns here. The allocator itself may or may not be different, but  the main point is the allocation patterns, precisely.

If I needed to explain it again, I'll again say that in this thread, I'm not looking for a drop-in replacement (whatever it is) for a standard allocator. I admit this may have appeared to have shifted a bit from the initial title of the thread. But that's part of the thought process. :)

Title: Re: Fast, predictable memory allocator for embedded stuff
Post by: brucehoult on November 08, 2021, 10:11:21 pm
realloc() is evil. Don't use it. Don't try to grow things in place.
Use SegArr, as described in my most recent comment.
I disagree: the SegArr just shifts the complexity to different code.

That's what engineering is :-)
Title: Re: Fast, predictable memory allocator for embedded stuff
Post by: SiliconWizard on November 09, 2021, 03:05:28 am
The current dynamic memory management interfaces (malloc()/realloc()/free() or new()/delete()) are, in my opinion, rather simplistic.  At least a hint "will-grow" or "will-shrink" via a flag should be useful; maybe also grouping allocations (for pools or groups that can all be released at once).  Dunno for sure; I just find it interesting to try and work out.

realloc() is evil. Don't use it. Don't try to grow things in place. (except if you know the memory allocator gave you N bytes and you know you're using only M, and you want to grow it to something else also less than N -- but that's for you to know, not the memory allocator.)

That's a bit drastic. It all depends on context - use case, underlying system, amount of free memory available, of course the type of allocator... If you're using the standard allocator, are on "general-purpose" system/OS and need a consecutive block, trying tro grow in place if free memory is available can make sense.

But the question of resizing allocations (growing/shrinking) efficiently is an interesting one anyway. In the context I'm working on here, that's the kind of allocation scheme I think should be avoided as much as possible though.
Title: Re: Fast, predictable memory allocator for embedded stuff
Post by: Nominal Animal on November 09, 2021, 03:31:50 am
realloc() is evil. Don't use it. Don't try to grow things in place.
Use SegArr, as described in my most recent comment.
I disagree: the SegArr just shifts the complexity to different code.
That's what engineering is :-)
True. I currently just see realloc() as being the proper place of complexity for certain cases –– not for all, by any means.

At the moment I'm just trying to write a set of specialized allocators that can be used each for specific use cases. As you saw, I'm not trying to reinvent the wheel either; I've - at least for now - implemented well known allocators, such as linear (also called "arena") and pool allocators.
Right.  I asked, because I'm problem-solving oriented, and wanted to know if there was a *particular* problem you were trying to solve; or just building tools for solving a general class of problems.

None of this is reinventing the wheel, in my opinion.  I haven't found anything that discusses which types of allocators are best suited for what kind of use cases, except for in projects that are implementing their own –– and even then it tends to be on the order of "hey, I think *this* works better than what we have right now; here's the patch series".

And then you get to the question of how much effort is appropriate here, considering how well already existing Garbage Collectors perform?  (I have no answer or opinion to that myself.)
The effort, actually - as you can guess - is not much about implementation. It took me just a couple hours at this point to get something working well. Sure it's just still a WIP. The effort is all about using allocation properly and choosing the "right" allocation patterns.
I was thinking of entire projects, actually; not just implementing the allocator or GC itself.

And I'm definitely not criticizing, just pointing out that at one end of the toolbox are the GC's, at the other end static allocations only, and a whole plethora of tools in the middle.  I myself am very interested in the tools in the middle, and enjoy the discussion; just thought it was prudent to mention the GC end since the static end has already been mentioned.

I do believe I covered reallocation patterns already: in the use cases I know, there is usually only one that does not have a final size yet, and it would be useful to allocate it such that growing it would be cheap safe; with an API 'promise' that it will be either freed, or resized to a final size, using a separate call.  It does not matter if reallocating it really is a malloc()+copy+free(), what matters is that any repeated reallocations of it won't cause undue defragmentation, useless holes in the heap.

Fixed size allocations are nice, because fragmentation is not an issue.

It is tempting to think that using a generic allocator at first, and just checking what kind of allocations it makes and choosing an optimized version dedicated to it shuld lead to optimum results, but fact is, like you and others have already mentioned, we the programmers choose the allocation patterns when we implement our code.  For better results, we should know beforehand what kind of allocator we target.

As an example, let's say I'm writing a smart display appliance based on Teensy 4.x ($20) or NXP MIMXRT1062, running at up to 600 MHz, with 512k of tightly coupled RAM, another 512k on the MCU, let's say 8M of PSRAM, and graphics resources (icons et cetera) on an microSD card (FAT fs, SDIO, fast enough).  I have three 'zones', of which the TCM is more or less performance critical (so much so that the hottest code is best copied to TCM), with the 512k being "fast", 8M "slower".  And microSD cards have basically unpredictable latency.  I should do what companies do, and just rely on the hardware being fast enough that even bare-assed software works for the users... but I cannot. I don't think I would use a single allocator for the three RAM zones.  I think I would first have to do some test cases to decide how to best use the zones, before doing any systems integration.  The allocation patterns and allocator schemes would be critical for long-term stability; and really need to be designed in before writing the code that relies on them.  It does *feel* a bit bass-ackwards to me, but if that's what stability and efficiency requires, then that's what I'd do.
Title: Re: Fast, predictable memory allocator for embedded stuff
Post by: T3sl4co1l on November 09, 2021, 06:47:28 am
GCs, even the "best" ones, for anything timing-critical is frankly out of the question in general. I just want allocators - at this point at least - that are ideally O(1), the execution of which thus doesn't depend on state, such as what had already been allocated or freed.

How about this:

Take the hash of the data.  Go to that address.  Write the data.

Doesn't even need to know the length of the data!

It's a hash map, so your memory utilization is stochastic, which does limit efficiency.  It also carries a random chance of overwriting previous data, but what are the chances that's going to happen?? ;-DD

Ok, so CS shitposting isn't exactly being helpful, but suffice it to make the point, I'm guessing you're looking for a deterministic, lossless one as well. :P

I just don't think I've seen this method suggested before; it's not quite a hash map, because that would do a check first, see if the slot (at given length) is actually available, and do something else if not.  As horrible as it sounds, there are some neat applications of lossy hash maps, for example to cache whether an item has been seen or not.  If the map is blank, definitely not; if it's nonzero, not necessarily so.  But that's good enough for a cache.

Tim
Title: Re: Fast, predictable memory allocator for embedded stuff
Post by: Kalvin on November 09, 2021, 06:49:10 am
One could argue that using a single/one memory allocator (typically malloc()) in a system is sufficient. Using a general purpose, one-size-fits-all memory allocator [like malloc()] in an embedded system will not provide optimal results.

If a system implements multiple tasks that are using dynamic memory allocation (like malloc()), it would be better to organize the architecture in a such way that each task would also manage its own, separate heap. In that way each task can use whatever memory allocation algorithm is suitable for that particular task. The tasks that are time-sensitive and has high requirements for determinism can use faster memory allocators, for example.

Separating the heaps will give also give some flexibility for the unfortunate situation where one task or system runs out of its heap memory, for example. Typically, if the system runs into memory allocation problem, the system needs to restart itself, or enter some sort of reduced operating state, unless the system can recover itself from the heap exhaustion.

If each task manages its own memory heap, it may even possible to restart some of the less critical tasks. Since each task manages its own memory heap, each task can also reinitialize its own heap without affecting other tasks. That would not be possible using a single, system-wide heap manager like malloc().

Also, if the system or a critical task runs out of heap memory, this critical task can shutdown less critical task(s) and steal some memory from the shutdown task(s) in order to be able to shutdown the system in a controlled manner, or enter and operate in the reduced operation state. With the system-wide heap manager this will not an easy operation to do.

Using separate heaps may increase RAM requirements, if not done wisely. On the other hand, using separate heaps it will be easier to measure, control and fine-tune the memory usage and RAM requirements for each task. Thus, it may be even possible to reduce the RAM requirements, because we can control the RAM usage more closely.
Title: Re: Fast, predictable memory allocator for embedded stuff
Post by: Kalvin on November 09, 2021, 08:09:15 am
I am using this practice/policy in my code shown below, which makes it very easy to change the memory allocator used in a module/task. All that is required is creating macros or an inline functions for malloc() and free() for each module/task.

Code: [Select]
//
// File mymodule.c
//

#define mymodule_malloc(N) malloc(N)
#define mymodule_free(P) free(P)

static bool mymodule_malloc_init()
{
    /* Nothing for standard malloc() */
   return true;
}

static void mymodule_malloc_cleanup()
{
    /* Nothing for standard free() */
}

It is now trivial to add instrumentation and debugging for the memory allocations for each module/task separately by redefining the module's/task's memory allocation macros/functions.

Using this convention it is possible to start with the evil general purpose malloc() and free(), and it is very easy to change the memory allocator at the later time if needed.
Title: Re: Fast, predictable memory allocator for embedded stuff
Post by: DiTBho on November 09, 2021, 09:38:47 am
That's what engineering is :-)

query("engineering is")
-> "engineering is merely the slow younger brother of physics, watch and learn gentlemen. Do either of you know how to open the tool box?"

(quote, Sheldon Cooper, engineer jokes
 ;D )
Title: Re: Fast, predictable memory allocator for embedded stuff
Post by: DiTBho on November 09, 2021, 10:47:41 am
If a system implements multiple tasks that are using dynamic memory allocation (like malloc()), it would be better to organize the architecture in a such way that each task would also manage its own, separate heap.

That's what I did for the industrial sewing machine. But they are all stalloc (~ pool) methods. There are three bitmaps to manage free blocks; inserting, deleting and searching for free-blocks are time consuming linearly, o (n), but that's okay because "n" is small, ~ <200.

I also have a Btree-like based application that needs to map a large amount of texture patterns, about ~ 35,000 elements to be passed in turn to the engine. The Btree internally uses a linear stalloc to allocate nodes, which is slow, then pre-allocated all items at startup, so the tree is already balanced and needs no future processing at run-time and the search is as fast as ~ o (log (n)) vs ~ o (n).

Hybrid solution, on large pieces of ram. I also changed the hardware to map 512Mbyte instead of 128Mbyte to have no further compromises.
Title: Re: Fast, predictable memory allocator for embedded stuff
Post by: SiliconWizard on November 09, 2021, 05:38:42 pm
And I'm definitely not criticizing, just pointing out that at one end of the toolbox are the GC's, at the other end static allocations only, and a whole plethora of tools in the middle.  I myself am very interested in the tools in the middle, and enjoy the discussion; just thought it was prudent to mention the GC end since the static end has already been mentioned.

Well, mentioning GCs in the discussion made sense. But I again do not think they are a good idea in the specific context I'm dealing with here. Beyond the predictability issues, another big issue, as I already said, is that they promote a "lazy" kind of approach for allocating memory, while I do think enforcing particular allocation patterns through the use of specific allocators is a better approach here.

As an example, let's say I'm writing a smart display appliance based on Teensy 4.x ($20) or NXP MIMXRT1062, running at up to 600 MHz, with 512k of tightly coupled RAM, another 512k on the MCU, let's say 8M of PSRAM, (...)

That's pretty "funny"... I am actually working on implementing a MIDI Synth on a Teensy 4.1 board, and after implementing all the base functions, I was thinking about ways to make efficient use of the external PSRAM! So that's kind of what triggered the whole thing here. But then of course I began thinking of all the potential applications of using dynamic allocations for typical embedded development while trying to avoid the usual pitfalls.
Title: Re: Fast, predictable memory allocator for embedded stuff
Post by: T3sl4co1l on November 09, 2021, 06:43:06 pm
Ah yeah, that's another reason library built-ins can't be used, at least as-is: when allocating additional address spaces.

Similar thoughts have crossed my mind for things like simple file storage.  Which might not be too bad embedded, a stack of presets for example can just pack into a struct, slot it into an array and you're good.  Harder if you want to do wear leveling.  If you want to have named, variable length and structured things, I suppose at some point you just want to put in a file system of whatever sort.

Tim
Title: Re: Fast, predictable memory allocator for embedded stuff
Post by: Nominal Animal on November 10, 2021, 08:47:07 am
I am actually working on implementing a MIDI Synth on a Teensy 4.1 board, and after implementing all the base functions, I was thinking about ways to make efficient use of the external PSRAM! So that's kind of what triggered the whole thing here. But then of course I began thinking of all the potential applications of using dynamic allocations for typical embedded development while trying to avoid the usual pitfalls.
That's why I liked Kalvin's post regarding multiple heaps so much: it nicely tied in my example on the Teensy with my point about using (one) reallocated block, with multiple heaps.

The dynamically resized buffer I occasionally need, is better thought of as a separate (temporary) heap.  Even the patterns of use differ in similar manner as using e.g. the different regions of Teensy RAM.

(I'm actually thinking of using 8+8MiB PSRAM on my 4.1 (https://www.pjrc.com/store/psram.html), since I don't need any more Flash, just RAM.  And boy, do I wish I had the skills to do proper MAPBGA designs – never done more than 2-layer boards! – to do a Teensy 4.x variant with 18+ contiguous FlexBus1/2 pins exposed, for parallel DMA I/O.  PJRC sells the preprogrammed MKL02 (https://www.pjrc.com/store/ic_mkl02_t4.html) chips that do all the Teensyduino magic from bootup to firmware updates for $6.25, so for a couple of DIY boards for use as a USB-connected framebuffer display to use with Linux SBCs and routers, it'd be very affordable, with the nasty bootup side already well worked out.  I just haven't got a reply yet as to what they prefer wrt. derivative schematic and board files (https://forum.pjrc.com/threads/68592-PJRC-preference-wrt-Teensy-derivatives-schematics-and-board-files?p=292509#post292509) licensing/openness.)

Similar thoughts have crossed my mind for things like simple file storage.  Which might not be too bad embedded, a stack of presets for example can just pack into a struct, slot it into an array and you're good.  Harder if you want to do wear leveling.
If you have a sufficiently large, say 64-bit monotonic sector generator counter at the beginning of each sector (or multi-sector struct), it only takes 3+ceil(log2N) reads, IIRC, to find the oldest and the newest sector when using the entire N-sector device as a circular buffer. Also, one does not need to read the entire sector either, just the initial data (which doesn't speed up the search, but simplifies the code), unless I remember wrong.
That gives perfect wear levelling, and all you need to do to initialize a new media is to clear it to all zeroes or all ones.

If your updates are sector-aligned, you can increment the sector counter one extra time before writing the first sector of each set.  This does mean finding the newest sector set will then have to do additional sector reads to find the initial sector of the set, and you won't know if the oldest sector set is complete or not.

If your updates are sector-aligned, and at most M sectors long, you can use the lowest k=ceil(log2M) bits of the counter as the index within the set, and increment the counter by 2k between first sectors in consecutive sector sets.  This has the benefit of not slowing down the latest sector set search, since if you read sector i = s<<k + p , you know that the first sector of that set is at i-p .  Finding the oldest sector set, and the end of any sector set, does need 1+k additional sector reads, though, but that should be okay.
(Note: the contents are still in consecutive sectors, it is just the counter that is no longer consecutive.)
Title: Re: Fast, predictable memory allocator for embedded stuff
Post by: DiTBho on November 10, 2021, 10:27:45 am
using (one) reallocated block, with multiple heaps.

I don't know, personally to simplify the testing activity I don't *re-use* but rather increase the ram to keep things isolated and segregated.

I have too many difficulties otherwise :-//
Title: Re: Fast, predictable memory allocator for embedded stuff
Post by: SiliconWizard on November 10, 2021, 06:42:39 pm
I am actually working on implementing a MIDI Synth on a Teensy 4.1 board, and after implementing all the base functions, I was thinking about ways to make efficient use of the external PSRAM! So that's kind of what triggered the whole thing here. But then of course I began thinking of all the potential applications of using dynamic allocations for typical embedded development while trying to avoid the usual pitfalls.
That's why I liked Kalvin's post regarding multiple heaps so much: it nicely tied in my example on the Teensy with my point about using (one) reallocated block, with multiple heaps.

If you have read carefully what I said all along in this thread, this one point would be implied.
In all my "musings" about allocators, not *once* have I even considered using the "standard" approach with a single "heap" and general-purpose allocators. Actually, I think the best here, in this discussion, would be to try and forget the *heap* and not even call memory available for allocation a "heap". Or even several of them. That would avoid being biased by what we are used to.

Actually, as I said, using different allocators for different use cases would be the way to go, IMO, and each of those allocators could potentially tap into a different area of available memory. One can also, on top of that, "cascade" allocators. For instance, at the moment, I'm using a linear (or "arena") allocator as the base allocator for the whole external memory (but I may split it further later on), and pool allocators which themselves allocate from memory given to them by the former allocator.

We could add even more layers of allocation if required. The whole thing, apart from selecting the right allocators depending on the kind of data you're manipulating, is once again about controlling the lifetime of objects. Thanks to the base linear allocator with markers, you can precisely define lifetimes (although you obviously have the constraint here that you must arrange lifetimes in "LIFO" order) and release all objects with a given lifetime at once in O(1).

Also as evoked earlier, even with standard allocators on "standard" systems - say Linux or Windows - the "heap" and malloc() are just a local "fantasy". ;D
What happens is that it's all *layers* of memory management and allocators, from your local "heap" to the actual memory reserved on the system level.
Sure, if you're using malloc() on a typical small embedded system such as an MCU with a minimal environment, the "heap" will usually just be a single area of available memory determined at link time, and malloc() in the std lib will be the only allocator behind the scenes, but that's not the case at all for more complex systems.

As to "reallocation", it would be interesting to discuss exact use cases and how we can often replace this with another allocation scheme.
So, for instance:
- Do you typically use reallocation for growing blocks, or for both growing and shrinking? (My own use cases of realloc() are almost always only 'growing'.)
- Is that typically to implement dynamic tables?
Title: Re: Fast, predictable memory allocator for embedded stuff
Post by: Nominal Animal on November 10, 2021, 08:48:43 pm
Actually, I think the best here, in this discussion, would be to try and forget the *heap* and not even call memory available for allocation a "heap". Or even several of them. That would avoid being biased by what we are used to.
Right.  In that case, let me rephrase: the pattern where I do use reallocation is very distinct.

- Do you typically use reallocation for growing blocks, or for both growing and shrinking? (My own use cases of realloc() are almost always only 'growing'.)
Typical cases are buffers for things like files that fit in memory.  I do not use the pattern where one examines the file size, and then reads that amount of data, because it is unreliable.  Instead, I read into a dynamically growing buffer, and when complete, (do a whitespace compaction, comment removal pass and then) shrink the allocation to the size used.  The shrinking is useful if additional allocations are done before the file area is freed, for example for key data needed later.

When splitting one into separate chunks, the possibility of a chunk ending in the middle of a multibyte identifier (say, escape sequence, or between CR and LF in a file that contains CR LF newlines) is the detail that makes handling them in chunks annoying.  The simple approaches are "slow", and the fast approach (a true FSM that can be progressed byte by byte) complex or hard to maintain.

In the getline() pattern, the buffer may grow, but isn't shrunk; it will eventually be freed.  I do not want to pre-allocate a buffer that can contain the largest possible line (buffer) I want to support, because that occurs too rarely.  As an example, most C source lines are short (say, under 200 characters), but occasionally you have generated code etc. that can have lines with thousands of characters.

- Is that typically to implement dynamic tables?
No; the common denominator seems to be human readable text or a related format, like config files, JSON, HPGL, G-code, etc., and input of unknown length or complexity.

I often use a binary heap for timeouts, with the heap represented as an array of (time, event_id) pairs.  Each event_id refers to a slot in a separate array, that contains the timeout state and an offset back to the entry in the heap.  Percolation is only a bit more complex than in an ordinary binary heap, since also the reference back to the hash table needs to be updated when an entry moves in the heap.  That makes it cheap to delete any entry using just the event_id.  The slots can be split into fixed-size chunks trivially, but the heap really does need to be contiguous in memory.
However, when the heap array needs to grow, the old one can be freed and a new one allocated from scratch, because the new one can be populated in a single pass over the active slots.  It can be easier to use realloc instead, but I don't think I really need realloc() to implement a timeout heap efficiently.

For hash tables, I use malloc()+copy+free(), not realloc().

Images etc. specify their size before the data is read, unlike text/stream formats, so no realloc() needed there either.

An interesting detail: Current implementations for parsing floating-point numbers are slow.  Even on spinny HDDs, the parser tends to be the bottleneck, not the storage I/O speed.  Daniel Lemire's Number Parsing at a Gigabyte per Second (https://arxiv.org/abs/2101.11408) describes the issues and approaches well, but even it gives up and uses the old arbitrary precision approaches for values with more than 19 digits in the significand (decimal part excluding the power of ten exponent).  (Arbitrary/multiprecision support does not need realloc(), because the new size is known when summing or multiplying anyway.)
While this might sound irrelevant to embedded stuff (since HPGL and G-Code limit real parameters to a smaller range and precision anyway), your JSON and configuration data can contain decimal-format floating-point values, and the slowness (and memory needs) when parsing these can be surprising.  Speccing max. 19 decimal digits in the significand, and rejecting the input if more are given (or just using a best guess approximation), can make a big difference in both speed and firmware size.

When I do use realloc(), I almost always have only one "active" region I realloc(); it is extremely rare to need more than one such chunk of memory at the same time.
I often do do "normal" malloc()s at the same time, with varying life times, but these never get realloc()'d.

Perhaps the best description is that on top of my normal malloc()+free() patterns, I sometimes need a separate buffer whose maximum size should depend on the available memory, but should not preclude making normal allocations.  I can even live with this buffer being moved or resized when making those separate allocations; no problem.
This is why I suggested that maybe a separate interface for this, where the user code specifies the size it currently uses and may modify it at any time, but the allocator dictates its maximum size and may move and resize it to satisfy a normal allocation done at the same time.
Title: Re: Fast, predictable memory allocator for embedded stuff
Post by: brucehoult on November 10, 2021, 10:15:35 pm
An interesting detail: Current implementations for parsing floating-point numbers are slow.  Even on spinny HDDs, the parser tends to be the bottleneck, not the storage I/O speed.  Daniel Lemire's Number Parsing at a Gigabyte per Second (https://arxiv.org/abs/2101.11408) describes the issues and approaches well, but even it gives up and uses the old arbitrary precision approaches for values with more than 19 digits in the significand (decimal part excluding the power of ten exponent).  (Arbitrary/multiprecision support does not need realloc(), because the new size is known when summing or multiplying anyway.)

Interesting to see someone else has done this now. I did the same thing in a proprietary system in 2006 and wanted to publish but was not allowed to. Most literals are far smaller than 19 digits, and if you're printing from a floating point value then you never need to print more than 17 digits to uniquely identity which IEEE double you started with anyway.

Also: you don't need arbitrary precision. Multiple-precision, yes, but not arbitrarily large. There is an upper limit that is if I recall correctly something like 1024+53+53 bits, or 144 bytes if you round to a multiple of 32/64/128 bits, no matter how long the input string of digits is (could be millions of digits).

The key is to use the first 17 digits to find which pair of doubles the input value lies between, then start to convert the mid-point of those two values back to decimal. You can stop as soo as you find an input digit that is greater than or less than the expansion of that mid-point.
Title: Re: Fast, predictable memory allocator for embedded stuff
Post by: Nominal Animal on November 11, 2021, 08:58:47 am
An interesting detail: Current implementations for parsing floating-point numbers are slow.  Even on spinny HDDs, the parser tends to be the bottleneck, not the storage I/O speed.  Daniel Lemire's Number Parsing at a Gigabyte per Second (https://arxiv.org/abs/2101.11408) describes the issues and approaches well, but even it gives up and uses the old arbitrary precision approaches for values with more than 19 digits in the significand (decimal part excluding the power of ten exponent).  (Arbitrary/multiprecision support does not need realloc(), because the new size is known when summing or multiplying anyway.)
Interesting to see someone else has done this now. I did the same thing in a proprietary system in 2006 and wanted to publish but was not allowed to. Most literals are far smaller than 19 digits, and if you're printing from a floating point value then you never need to print more than 17 digits to uniquely identity which IEEE double you started with anyway.
That – not getting permission to publish – is really annoying.

A lot of HPC – at least molecular dynamics simulations – still uses text input/output for a number of reasons, and that is artificially slow, because libraries etc. don't want to incorporate the known faster approaches until they've been properly verified in peer-reviewed journals.  Exhaustive brute force testing is not feasible.

Also: you don't need arbitrary precision. Multiple-precision, yes, but not arbitrarily large. There is an upper limit that is if I recall correctly something like 1024+53+53 bits, or 144 bytes if you round to a multiple of 32/64/128 bits, no matter how long the input string of digits is (could be millions of digits).
Of course.  The exponent in IEEE 754 Binary64 'double' has 11 bits and mantissa 53 bits, so it cannot be more than 2048+53+1+1 (the extra bits for exact rounding and overflow detection), or 264 bytes.

I meant that the existing C libraries use either multiprecision or arbitrary precision libraries to implement this.  GNU C for example uses GMP, GNU Multiprecision Library, which is actually an arbitrary precision library.

The key is to use the first 17 digits to find which pair of doubles the input value lies between, then start to convert the mid-point of those two values back to decimal. You can stop as soo as you find an input digit that is greater than or less than the expansion of that mid-point.
In general, obtaining the limiting pairs of doubles that bracket the decimal representation, would be useful in many other ways, too: for example, general interval arithmetic (https://en.wikipedia.org/wiki/Interval_arithmetic).

Circling back to the topic of memory allocation, when parsing numbers from a stream-like source (say, serial connection), to support exact conversion of all numeric formats, we do have to (temporarily) store the entire token, since we need to know the power of ten exponent (zero, if not specified), before picking the approach.
It is possible to limit that temporary storage requirement to a few dozen bytes by writing your own numeric parser from scratch (handling pathological cases like leading or trailing decimal zero digits), but the code is rather complex and math-intensive.  If you have the entire number as a string in some buffer, even avr-libc provides a nice strtod() you can use to parse it into a double.  (Not to mention that you can do speculative non-destructive parsing, too.)

This is the tradeoff I see in the cases where I use reallocation to dynamically resize the buffer I need: being able to reallocate/grow just one active buffer when needed, I can avoid nasty complexity elsewhere.  For example, when parsing a flat configuration file with just key-value pairs, I typically implement a get-key function, one or more get-value functions, and a next-key function (to skip any unread values to the beginning of the next key).  The get-key function returns the key identifier, using the hash of the key and the key text to look it up.  The get-value functions parse for example a numeric value, a true/yes/1|false/no/0 boolean, or an identifier of some sort, or say a 2D or 3D point.  Depending on the use, I can allow more than one value per key.  The next-key skips to the beginning of the next key, and reports if it skipped anything that might have been considered a value.  This approach also works well to parse HPGL and G-code streams.
The active buffer is mostly like a scratch pad or cache.  For 99% of the time, the default size is fine.  Yet, for ease of use for the end users, I would prefer it to work, even if slower, for that 1% of oddball cases too.
Title: Re: Fast, predictable memory allocator for embedded stuff
Post by: SiliconWizard on November 11, 2021, 06:27:17 pm
Just a thought here - I've myself never used reallocation for such use cases, but disclaimer: I may have to see the parsing code, because maybe there's something I'm missing.

Unless you're extremely short on memory - very small targets - and/or you have to concurrently parse a lot of files, allocating a temporary buffer of several tens or even hundreds of bytes to accomodate the largest "token" is usually more than good enough and much better in terms of performance. I don't think I've ever done otherwise, unless of course the "tokens" we may encounter are not bounded in size, which would make the non-reallocating approach fail in some cases. But tokens of arbitrary size in text files, that's pretty rare in practice.
Title: Re: Fast, predictable memory allocator for embedded stuff
Post by: Marco on November 17, 2021, 11:40:56 am
Ultimately the need to accommodate nearly arbitrary allocations of truly random size which need to be linear addressable is restricted to game engines streaming in data from other types of storage and in particular with no or a 32 bit MMU (64 bit won't fragment fast enough to matter). The graphics hardware necessitates the linear addressing, the streaming and large dynamic ranges of object sizes necessitates full dynamism. It's the only application I can think of where defragmenting on the fly makes much sense.

Any time else even if for some reason you need to keep it fully dynamic and for allocations of arbitrary size it's better to just use a language which can elegantly and with an easy to follow programming style use growing/shrinking buffers based on linked lists (ie. not C). The performance impact will be less than defragmenting. Even for a sane GC language (ie. not C+GC) for a no or 32 bit MMU processor it doesn't make much sense to build on arrays (with a 64 bit MMU you can get away with it again).
Title: Re: Fast, predictable memory allocator for embedded stuff
Post by: DiTBho on November 17, 2021, 11:52:44 am
The graphics hardware necessitates the linear addressing

I have recently seen something similar in the Nitro(gen) development system for Nintendo DS.
Title: Re: Fast, predictable memory allocator for embedded stuff
Post by: SiliconWizard on November 17, 2021, 06:18:16 pm
To wrap it up a little - even if I think I'm repeating myself here - it's really all about using proper allocation patterns and managing lifetime of objects, rather than about the allocators themselves (which will just be selected as a consequence.)

@DiTBho, I don't think we disagree as much as you may have figured. I still very much think abritrary-size allocations with standard allocators and the standard heap in embedded dev is usually to be avoided at all costs (of course there are always exceptions, and maybe also that's fine if you really really know what you're doing. At all times.)

As I mentioned earlier, what you find a lot more reliable and more comfortable to deal with, is statically allocated memory, that you may still use in a dynamic way in some cases at run time. My point here is that this approach (that we have all used I'm sure) is a form of dynamic allocation, just an ad-hoc one. And for those ad-hoc solutions, I'm thinking of replacing them with well tested allocators that you can reuse instead of writing ad-hoc schemes again and again.

Using various "allocators" instead of ad-hoc code also forces us to take the lifetime of objects much more seriously into account, which I think is a good thing. (No, again not the standard allocators. ;D )

And the benefit is that they can be extended to using memory that may only be "discovered" at run time, instead of being strictly tied to statically allocated memory. Of course in this use case, you must make sure that having a varying amount of usable memory that you can't know in advance is appropriate. Which is why it can't be used exclusively either. A mix of purely statically allocated objects and dynamically allocated ones - in ways I described earlier - would be the way to go.

@Marco: would be interesting to give more precise examples of truly arbitrary size allocations in game engines, because those are precisely most often using the kind of allocators that I'm talking about in this thread, among which pool allocators (fixed size) is one of the most commonly used. Of course, we should not confuse allocating an "arbitrary" number of objects each with a fixed size (typical pool allocation) with arbitrary size allocations. (Possibly obvious, but just saying.)
Title: Re: Fast, predictable memory allocator for embedded stuff
Post by: Marco on November 17, 2021, 08:56:02 pm
@Marco: would be interesting to give more precise examples of truly arbitrary size allocations in game engines, because those are precisely most often using the kind of allocators that I'm talking about in this thread, among which pool allocators (fixed size) is one of the most commonly used.

AFAICS the problem with pool allocators with a 32 bit virtual memory space and GB range memory is that you just don't have enough virtual memory space to over-provision the pools. They need to be able to shrink/expand the pools (as the balance of say texture and geometry changes frame by frame depending on what's being streamed in). For that they need to be able to defragment the virtual memory space.

The reason I say game devs use it is because of the blog post (https://programmerall.com/article/5594199315/) from the Chinese blogger I linked earlier.