What is the use of dynamic memory allocation in C language

#75 Reply
Posted by peter-h on 03 Jul, 2022 19:47
Quote
any valid data below SP will be corrupted the very same instant an IRQ arrives launching the stacking operation.

The arm32 uses a different stack for interrupts.

This question arose in connection with FreeRTOS stacks. Don't they all have to be big enough for interrupt servicing? With say 20 tasks that would be a lot of memory wasted. The answer is NO; a different stack is used for ISRs i.e. the automatic context save for interrupts switches the SP to another stack.

I don't know any more details.

Quote
the stack will be at the bottom of memory, with a fixed maximum size and growing downwards. The heap is above it, also with a fixed maximum size and growing upwards. When the stack reaches its maximum size (SP goes below the 1st address of available RAM), an error fires.

In terms of memory addresses, the stack is normally at the top of memory and grows downwards. Then, starting from the bottom of memory, initialised data, static variables, etc, are allocated. The top of the static stuff is the base of the heap, which grows upwards from there. The heap must not meet the bottom of the stack, and a function called sbrk() is used to tell malloc() whether the next request is allowed. This is my one:

Code: [Select]
caddr_t _sbrk(int incr) { extern char end asm("end"); extern char top asm("_top"); static char *heap_end; char *prev_heap_end; // This test is never met since heap_end is never going to be = 0 // "end" = 0x2000edf8 (see linkfile) if (heap_end == 0) heap_end = &end; prev_heap_end = heap_end; // top = 2001e000 (top of 128k RAM minus 8k stack) if (heap_end + incr > &top) { // write(1, "Heap and stack collision\n", 25); // abort(); errno = ENOMEM; return (caddr_t) -1; } heap_end += incr; return (caddr_t) prev_heap_end; }
There is a different version of _sbrk going around, which works, until it doesn't
https://community.st.com/s/question/0D50X0000AAIEWbSQP/stm32cubemx-generated-sbrk-in-syscallsc-not-working-with-freertos

But I may have everything backwards; people here know way more than I do

Re LWIP, it has various memory models. The one I am using is where you give it a static block to use and it runs its internal heap inside that. Apparently this is supposed to be deterministic because when the connection is closed, that area is all cleaned up.

I would never use a heap in any code I write, other than for grabbing some RAM for optional product features which are started and then never terminate all the time the thing is powered up.

I would like to know which C standard lib functions use malloc() internally. One tends to assume that printf() and its variants does, which will create havoc with any RTOS (fragmentation, or outright crashes unless you mutex malloc and free etc) unless you restrict the printf() usage to just one thread. I would certainly hope that sscanf() does not use the heap. This site
https://nadler.com/embedded/newlibAndFreeRTOS.html
suggests that printf does not use the heap unless you are outputting floats.

#76 Reply
Posted by emece67 on 03 Jul, 2022 23:00
.

#77 Reply
Posted by emece67 on 03 Jul, 2022 23:26
.

#78 Reply
Posted by ejeffrey on 04 Jul, 2022 00:24
Quote from: nctnico on 15 Feb, 2022 11:53
Quote from: brucehoult on 15 Feb, 2022 10:52
Quote from: Nominal Animal on 14 Feb, 2022 18:05
Thus, advise that says it is not a good idea to use dynamic memory management on a microcontroller with very limited amount of RAM, stems from practical reasons and human stuff, and not technical details. You could say very concisely that "most often, it is not cost-effective to use dynamic memory management on a microcontroller in C". It does not mean it cannot be done, just that the effort needed to make it work is usually too much, compared to the rewards.

I look at things another way.

There are tasks which inherently require dynamic memory management, based purely on the nature of their interface with the outside world.

For example, suppose we have a specification for a device that has a communication channel -- UART, WIFI, I2C ... it doesn't matter -- which sits in an event loop accepting the following messages (I will give it in text, but it could just as well be a binary format):

Code: [Select]
PUT key "value" GET key DEL key
"value" can have any length, at least up to some fairly high maximum .. it doesn't matter in principle whether this is a few hundred bytes or a few GB.
key might restricted to something fairly short. Let's say 32 characters.

PUT associates the value with the key for later retrieval by GET. DEL forgets the value associated with key.

You simply *can't* use static memory allocation for this unless you are prepared to limit the values to relatively small ones AND limit the total number of keys to your storage size divided by the maximum size of a value. Which, assuming that the average value is much smaller than the maximum possible one, is very inefficient.
That leads me to these questions:
1) What else are you going to use the memory for?
2) How are you going to guarantee that there is always enough memory to handle the longest value?
3) Do you allow the system to offer limited functionality at times where memory is low?

That's not the point. Bruce is assuming this is potentially the only use of dynamic memory in the system. The point is that if you preallocate the array you have to bound both the number and size of values. If you use dynamic allocation you can use a greater number of small values while allowing storage of an occasional large item. Even with a fixed amount of storage dedicated to this data structure you use it more efficiently with dynamic allocation -- assuming the overhead of dynamic allocation is less than the savings from having more accurate allocation and that you can deal with fragmentation.

Of course either way you need to handle out of memory.

#79 Reply
Posted by peter-h on 04 Jul, 2022 06:13
Quote
The ARM uses the 2nd stack for ISR if allowed to do so

Where is this enabled?

Quote
If in mbed (for some reason I think you are working with mbed now, maybe I'm confused) you can switch to the minimal-printf variant, not using malloc.

What is meant by "mbed"?

Quote
point 5.2.1.1 of the ARM EABI clearly states that a process may only access data placed at SP or above, not below SP:

Yes; you should never store data below the stack, because (in any "normal" system) interrupts will corrupt it. So local variables in a function are stored on a stack frame which sits above the SP.

Quote
I do not know of any bullet-proof method to check for heap-stack collisions

One can prevent heap going too high (because _sbrk checks the requested block size against the predeclared heap top) but the only way to prevent stack going too far down is some sort of hardware trap, but that assumes the said memory will get accessed. It might just get allocated, and perhaps accessed much later

BTW is there a utility which can display the heap?

#80 Reply
Posted by brucehoult on 04 Jul, 2022 07:28
Quote from: peter-h on 04 Jul, 2022 06:13
Quote
point 5.2.1.1 of the ARM EABI clearly states that a process may only access data placed at SP or above, not below SP:

Yes; you should never store data below the stack, because (in any "normal" system) interrupts will corrupt it. So local variables in a function are stored on a stack frame which sits above the SP.

PowerPC and x86_64 are therefore not "normal" systems, because their Linux/Mac ABIs all allow functions to use up to 128 bytes of stack space *below* the stack pointer.

Microsoft Windows similarly allows the use of up to 16 bytes below SP on Itanium, 8 bytes on 32 bit ARM, and 16 bytes on 64 bit ARM. But not on x86.

Consider [1]:

Code: [Select]
int foo(){ int a[10]; return a[3]; }
x86_64: (https://godbolt.org/z/8ajP3PoGP)

Code: [Select]
foo(): mov eax, DWORD PTR [rsp-44] ret
powerpc:

Code: [Select]
.L.foo(): lwa 3,-36(1) blr

[1] This is of course UB, but change the code to, say, the following and exactly the same thing happens, SP is not adjusted and local variable a[] is stored below SP.

Code: [Select]
int foo(){ int a[10]; a[0] = 0; a[1] = 1; for (int i=2; i<10; ++i) a[i] = a[i-1] + a[i-2]; return a[7]; }

#81 Reply
Posted by emece67 on 04 Jul, 2022 07:33
.

#82 Reply
Posted by peter-h on 04 Jul, 2022 07:54
Quote
SP is not adjusted and local variable a[] is stored below SP.

I am not as clever as you but I can't see this. The whole array a[10] is stored on the stack of that function, and SP points at the return address - otherwise the function could not return to the caller.

Quote
Cortex-M has 2 stack pointers. One (the Main Stack Pointer, MSP) is always used in handler mode. The other (Process Stack Pointer, PSP) may be used in thread mode. Whether thread mode uses MSP or PSP is controlled by the SPSEL bit in register CONTROL, setting the bit enables usage of PSP in thread mode. The default situation after a reset is for this bit to be cleared, so the default behavior is to have a single stack.

Thanks. I guess FreeRTOS must set this bit when it starts up. I had a dig around and found this

Code: [Select]
/** \brief Union type to access the Control Registers (CONTROL). */ typedef union { struct { uint32_t nPRIV:1; /*!< bit: 0 Execution privilege in Thread mode */ uint32_t SPSEL:1; /*!< bit: 1 Stack to be used */ uint32_t FPCA:1; /*!< bit: 2 FP extension active flag */ uint32_t _reserved0:29; /*!< bit: 3..31 Reserved */ } b; /*!< Structure used for bit access */ uint32_t w; /*!< Type used for word access */ } CONTROL_Type;
but CONTROL_Type or SPSEL are not referenced other than in .h files.

I can't find where this bit is set. In Cube debug, among the registers is CONTROL and the value in that, with RTOS running, is 100 so SPSEL=0.

#83 Reply
Posted by brucehoult on 04 Jul, 2022 09:57
Quote from: peter-h on 04 Jul, 2022 07:54
Quote
SP is not adjusted and local variable a[] is stored below SP.

I am not as clever as you but I can't see this. The whole array a[10] is stored on the stack of that function,

The array a[10] is stored in the stack region of memory, but BELOW the stack pointer, hence the x86 code returning the value at [SP-44] and the PowerPC code returning the value at -36(r1). (It's just a difference of alignment)

Quote
and SP points at the return address - otherwise the function could not return to the caller.

On x86 yes. On PowerPC the return address is not in RAM but in a register.

#84 Reply
Posted by peter-h on 04 Jul, 2022 12:23
Quote
On x86 yes. On PowerPC the return address is not in RAM but in a register.

Learn something every day

That cannot possibly work with interrupts unless the ISR auto-switches to a different stack.

Regarding the two stack pointers, MSP and PSP, I am seeing reasonable values in both of these, and checked that they are both moving about the right way in both ISRs and RTOS tasks, but nowhere can I see SPSEL=1 actually set.

It is possible that the debugger generates an interrupt in which case the SPSEL bit gets cleared, so always shows as 0 when stepping. However I also can't find where it might be set to 1. For example here
https://github.com/FreeRTOS/FreeRTOS-Kernel/blob/23f641850d2428eac3e164d6e735e6e92dc3914a/portable/GCC/ARM_CM4F/port.c#L256
they explicitly load 0 into it, and sure enough my own code does the same

Code: [Select]
static void prvPortStartFirstTask( void ) { /* Start the first task. This also clears the bit that indicates the FPU is in use in case the FPU was used before the scheduler was started - which would otherwise result in the unnecessary leaving of space in the SVC stack for lazy saving of FPU registers. */ __asm volatile( " ldr r0, =0xE000ED08 \n" /* Use the NVIC offset register to locate the stack. */ " ldr r0, [r0] \n" " ldr r0, [r0] \n" " msr msp, r0 \n" /* Set the msp back to the start of the stack. */ " mov r0, #0 \n" /* Clear the bit that indicates the FPU is in use, see comment above. */ " msr control, r0 \n" " cpsie i \n" /* Globally enable interrupts. */ " cpsie f \n" " dsb \n" " isb \n" " svc 0 \n" /* System call to start first task. */ " nop \n" ); }
The above loads the MSP and I found the code which also loads the PSP

Code: [Select]
void vPortSVCHandler( void ) { __asm volatile ( " ldr r3, pxCurrentTCBConst2 \n" /* Restore the context. */ " ldr r1, [r3] \n" /* Use pxCurrentTCBConst to get the pxCurrentTCB address. */ " ldr r0, [r1] \n" /* The first item in pxCurrentTCB is the task top of stack. */ " ldmia r0!, {r4-r11, r14} \n" /* Pop the registers that are not automatically saved on exception entry and the critical nesting count. */ " msr psp, r0 \n" /* Restore the task stack pointer. */ " isb \n" " mov r0, #0 \n" " msr basepri, r0 \n" " bx r14 \n" " \n" " .align 4 \n" "pxCurrentTCBConst2: .word pxCurrentTCB \n" ); }
but nowhere can I find where it sets SPSEL=1

#85 Reply
Posted by brucehoult on 04 Jul, 2022 13:38
Quote from: peter-h on 04 Jul, 2022 12:23
Quote
On x86 yes. On PowerPC the return address is not in RAM but in a register.

Learn something every day

That cannot possibly work with interrupts unless the ISR auto-switches to a different stack.

Sure it can, and does.

I don't know what they do with x86_64, but with PowerPC, Itanium and some others the interrupt routine simply decrements the stack pointer by 128 bytes before starting to push the return address and processor status and other registers.

What puzzles me, if anything, is why the RISC-V ABI didn't also adopt this common pattern of a "Red Zone" on the stack.

#86 Reply
Posted by tellurium on 04 Jul, 2022 14:08
Quote from: brucehoult on 04 Jul, 2022 13:38
What puzzles me, if anything, is why the RISC-V ABI didn't also adopt this common pattern of a "Red Zone" on the stack.

Maybe to save a precious 128 bytes of memory , keeping in mind embedded use cases?

#87 Reply
Posted by peter-h on 04 Jul, 2022 15:46
Very interesting thread.

Apart from the remaining 32F417 + FreeRTOS mystery of where SPSEL=1 gets set, it is pretty obvious that the heap is pretty useless in an RTOS environment. Even if you mutex malloc() and free() your system will eventually blow up due to fragmentation. So the heap can be used from only one RTOS task, or perhaps from others if you can make sure there is no time overlap and the allocated memory is never freed.

#88 Reply
Posted by ejeffrey on 04 Jul, 2022 16:28
Quote from: peter-h on 04 Jul, 2022 15:46
it is pretty obvious that the heap is pretty useless in an RTOS environment. Even if you mutex malloc() and free() your system will eventually blow up due to fragmentation.

It's neither obvious nor true, and there are several ways to deal with fragmentation. The most generic is to use memory handles which was very common in systems without virtual memory. Pool allocators are useful for transaction processing where you might need to allocate several small pieces of memory for a task but then free them all a once when the task is complete. You can use power of two allocators with pools for each size to limit fragmentation.

Quote
So the heap can be used from only one RTOS task, or perhaps from others if you can make sure there is no time overlap and the allocated memory is never freed.

Also not true. You really have to not think of "heap / dynamic allocations" as "randomly and carelessly calling malloc all over the place." For instance you can have multiple fixed memory pools, one per thread/task. You can preallocate memory for a particular data structure and then dynamically assign subunits as needed. For instance a network stack might have a fixed memory allocated for the whole stack and dynamically allocate packet buffers from it.

There are many RTOS applications where dynamic allocation is unnecessary and best avoided. And it's likely none of these approaches will work for the smallest memory systems. But there are plenty of applications for dynamic memory in embedded systems.

#89 Reply
Posted by peter-h on 04 Jul, 2022 16:41
Quote
For instance a network stack might have a fixed memory allocated for the whole stack and dynamically allocate packet buffers from it.

Sure; this is what LWIP does. It is also what FreeRTOS does although that does it only once, as tasks are created.

That is what I meant when referring to allocating memory for optional features; memory which is never freed. LWIP frees up its private heap when the connection is closed, AIUI.

But private heaps are a bad example because the original block is statically allocated, with a worst-case size.

Quote
You can use power of two allocators with pools for each size to limit fragmentation.

If you do a malloc() of say 2048 bytes, doesn't it grab slightly more than 2048 (due to it being a linked list)? What units does one have to run to avoid fragmentation?

#90 Reply
Posted by emece67 on 04 Jul, 2022 17:42
.

#91 Reply
Posted by peter-h on 04 Jul, 2022 19:39
Ah yes... very clever

#92 Reply
Posted by eutectique on 04 Jul, 2022 21:01
Quote from: peter-h on 04 Jul, 2022 15:46
Apart from the remaining 32F417 + FreeRTOS mystery of where SPSEL=1 gets set

Here:

Code: [Select]
void vPortSVCHandler( void ) { __asm volatile ( " ldr r3, pxCurrentTCBConst2 \n"/* Restore the context. */ " ldr r1, [r3] \n"/* Use pxCurrentTCBConst to get the pxCurrentTCB address. */ " ldr r0, [r1] \n"/* The first item in pxCurrentTCB is the task top of stack. */ " ldmia r0!, {r4-r11, r14} \n"/* Pop the registers that are not automatically saved on exception entry and the critical nesting count. */ " msr psp, r0 \n"/* Restore the task stack pointer. */ " isb \n" " mov r0, #0 \n" " msr basepri, r0 \n" " bx r14 \n" " \n" " .align 4 \n" "pxCurrentTCBConst2: .word pxCurrentTCB \n" ); }
ldmia loads LR with 0xfffffffd. Then bx r14 returns to Thread mode and PSP is used.

#93 Reply
Posted by brucehoult on 05 Jul, 2022 01:07
Quote from: tellurium on 04 Jul, 2022 14:08
Quote from: brucehoult on 04 Jul, 2022 13:38
What puzzles me, if anything, is why the RISC-V ABI didn't also adopt this common pattern of a "Red Zone" on the stack.

Maybe to save a precious 128 bytes of memory , keeping in mind embedded use cases?

"Embedded" covers a very wide range these days!

These days "embedded" covers things such as Pi compute modules (and RISC-V equivalents mostly with the Allwinner D1 chip at the moment) with 256 MB to a few GB DRAM.

Sure, a lot of popular AVRs and PICs have only 512 bytes or 2 K of RAM (and some have much less), but I don't recall seeing a RISC-V core that has -- or even that can be optioned with -- less than 16 KB DTIM. Even at that level a 128 byte buffer between main program stack and an interrupt handler is less than 1% of RAM. Popular embedded chips such as the ESP32-C3 (400 KB) or Bouffalo BL602 (276 KB) have a lot more, let alone the Kendryte K210 with 8 MB.

Also, there can be more than one ABI.

The original (or "Linux") RISC-V ABI is designed to maximise compute speed in the main program code. Having (as is quite standard these days) a mix of caller-save and callee-save registers allows many leaf functions (those with only scalar local variables, and no more than 15 of them) to not have to build a stack frame at all. A "red zone" extends that to functions with a small number of local structs and arrays. It's only two instructions saved, but every little bit helps.

There can also be a specialised "embedded" ABI. Some people have been calling for an ABI with less latency on entering interrupt handlers, by only having to save maybe 6 or 7 registers before entering C code (like Cortex M) instead of 15. A proposal has been in the works for a few years now. I did some work on it in 2019 when I was at SiFive. For example, you might reduce the number of argument registers from 8 to 4 (the same as ARM) and reduce the number of temporary registers from 7 to maybe 2. You could also have the "embedded" ABI not have a Red Zone even if the Linux ABI does.

Actually, I suspect the "embedded" ABI might never be made official. The loud voices that wanted one became quite unhappy when they learned that reducing the number of registers that needed to be saved on an interrupt causes so many more register spills and reloads to the stack during normal execution that it reduces the speed of main program code by 20%. Yes guys, the standard ABI was designed to maximise the speed of the main program -- if you tinker with it then you get slower, by definition. (Unless it was done badly in the first place)

Also, "__attribute_((interrupt))" was added to the C compiler. A function with this decoration saves only as many registers as it will actually use. The full set or registers are saved only if and when it calls out to a C function that doesn't have the interrupt attribute.

And this basically solves the problem of interrupt handlers that are so short that saving 15 registers instead of 6 makes a significant difference. In fact it's better, because a simple "__attribute__((interrupt))" function might save only 1 or 2 or 3 registers.

#94 Reply
Posted by brucehoult on 05 Jul, 2022 01:34
Quote from: ejeffrey on 04 Jul, 2022 16:28
Quote from: peter-h on 04 Jul, 2022 15:46
it is pretty obvious that the heap is pretty useless in an RTOS environment. Even if you mutex malloc() and free() your system will eventually blow up due to fragmentation.

It's neither obvious nor true, and there are several ways to deal with fragmentation. The most generic is to use memory handles which was very common in systems without virtual memory. Pool allocators are useful for transaction processing where you might need to allocate several small pieces of memory for a task but then free them all a once when the task is complete. You can use power of two allocators with pools for each size to limit fragmentation.

It certainly makes sense to have a small number of different sizes, and round each allocation up to the next size. Power of two might be a bit extreme. I've done memory managers for what are essentially embedded software (mobile phones with a few hundred KB of RAM, old 6502 and z80 machines with 64 KB or less). Often having blocks of 1.0 and 1.5 times a power of two works well e.g. 8, 12, 16, 24, 32, 48, 64, 96... In my experience, going more fine-grained than that gets counter-productive.

When such a block is freed, it goes on to a list of free blocks of that size, and can be immediately reused.

Usually I create groups of small blocks by subdividing a large block of maybe 256 bytes or 1K or 4K (fixed, depending on total RAM size) into equal sized small blocks. Then, if an entire large block of small blocks becomes free, it can be reallocated for a different block size if needed. Sorting free blocks of a given size can promote this happening (maybe just a heap, not a full sorted list).

Many systems have periodic memory needs. Often there is a startup phase that is very distinct from the running phase, so it is good to reclaim as much memory as possible at the end of startup. It is good if you can get the application programmer to call a special memory management function at this point.

Pools are good for periodic allocation and mass freeing when the task doing that can be easily separated from other things happening at the same time (or only one thing happens at a time).

One mistake many memory managers have historically made is a large block being freed, and then the memory manager splits it up to service smaller requests. e.g. "first fit" or "worst fit" algorithms. If the demand is periodic then a large block of the same size will very likely be needed soon, and if the original has been split up then a new large block must be found. Repeated many times this can result in a large number of small, unneeded, unusable memory blocks but no contiguous space for a large block to be allocated.

Real memory managers stopped making that mistake decades ago, but alas not people rolling their own by following their old university textbook!

Another thing I do when I have full control over the whole program is to simply forbid individual memory allocations over a certain size. On a machine with a few hundred KB of RAM that might be as small as 256 bytes. On a machine with hundreds of MB to a couple of GB it might be 4 KB (or the same as the VM/TLB page size). If the programmer wants an array bigger than that then they have to use an array of pointers to blocks, or a tree of blocks, or a list. I provide classes for arrays and strings and buffers that do this internally, without complicating the external interface.

#95 Reply
Posted by brucehoult on 05 Jul, 2022 01:38
Quote from: peter-h on 04 Jul, 2022 16:41
Quote
You can use power of two allocators with pools for each size to limit fragmentation.

If you do a malloc() of say 2048 bytes, doesn't it grab slightly more than 2048 (due to it being a linked list)? What units does one have to run to avoid fragmentation?

Only if "it" is stupid. Don't use stupid software.

Competent people writing things such as power of two allocators store metadata for the blocks somewhere else.

Block metadata tends to be the same size regardless of the block size, so can be conveniently stored in an array-like structure somewhere.

There are also advantages because many operations only need to look at or modify the metadata, not the data block itself, so keeping the metadata together reduces cache misses, TLB misses, page faults ...

#96 Reply
Posted by dmsc on 05 Jul, 2022 04:19
Hi!

Quote from: brucehoult on 04 Jul, 2022 13:38
Quote from: peter-h on 04 Jul, 2022 12:23
Quote
On x86 yes. On PowerPC the return address is not in RAM but in a register.

Learn something every day

That cannot possibly work with interrupts unless the ISR auto-switches to a different stack.

Sure it can, and does.

I don't know what they do with x86_64, but with PowerPC, Itanium and some others the interrupt routine simply decrements the stack pointer by 128 bytes before starting to push the return address and processor status and other registers.

x86 is always over-complicated: in 32 bit mode, interrupts are searched in the IDT (interrupt-descriptor-table), where the address (and segment) of the interrupt handler and the protection level is specified, then the stack pointer (and the stack segment) is loaded from the TSS (the task-state-segment, a table with the stacks available for each switch from one protection level to another) if specified. This is one of the reasons that interrupt latency is that big on x86.

In 64 bit mode, there is a new IST (interrupt-stack-table) that can be used to load stacks for interrupts with less indirection.

#97 Reply
Posted by peter-h on 05 Jul, 2022 06:12
Quote
Competent people writing things such as power of two allocators store metadata for the blocks somewhere else.

That means the malloc() function needs an array allocated and that sets a limit on how many blocks can be in use. Or does it use its own allocation mechanism for the array, growing it as necessary?

#98 Reply
Posted by brucehoult on 05 Jul, 2022 11:06
Quote from: peter-h on 05 Jul, 2022 06:12
Quote
Competent people writing things such as power of two allocators store metadata for the blocks somewhere else.

That means the malloc() function needs an array allocated and that sets a limit on how many blocks can be in use. Or does it use its own allocation mechanism for the array, growing it as necessary?

Of course, if it is competently written.

There are many different malloc() libraries in the world.