Pointer confusion in C -language

#100 Reply
Posted by PlainName on 27 Jun, 2021 11:34
Ah, yes. As I say, not enough coffee yet $:=\$

#101 Reply
Posted by Nominal Animal on 27 Jun, 2021 11:38
Quote from: DiTBho on 27 Jun, 2021 08:36
I did this mistake years ago with an electric kart ... I had said "I don't see what could go wrong, let's do it with the the cheapest wrenches", then I lost one wheel in a corner ... and I crashed into a hay bale. Now I know, why pneumatic ratchet wrenches are better.
Dude, pick the right tool for the job!

Price alone is not a yardstick you want to use for comparing tools. You don't get what you don't pay for.
(The opposite, "You get what you pay for", is patently untrue, because a LOT of people are willing to sell you repackaged shit if they can make a profit.)

(And no, that is not ironic, because I've already explained the exact cost I'd be willing to pay for the stack guard interrupt stuff: Basically zero on the price of the MCU, only a small reduction in performance for stack-related operations. If I had more faith in, or knew compiler developers well enough they'd be willing to listen to me, I would probably implement this in software. As is right now, the darn things don't even recognize the most typical bug scenarios, so if I go that route, I'll be bogged down in that swamp for years.. with very little results, because of the human aspect, and because of "the standard says we can do inane thing, and it is easy; so that's what we'll do".)

Quote from: DiTBho on 27 Jun, 2021 08:36
I then found out that below -20C they break. At -45 ° C, it means no fire, no liquid water to drink, no food.
Yep. Just a couple of weeks ago, my brother found out what I meant when I told him "these devices cannot handle being stored in freezing temperatures". In his haste to keep things tidy, he decided to store a bunch of power supplies in a shed outside for the winter, because that way they weren't an eyesore in the storage closet indoors. In a locale where winter temperatures usually reach -30°C or below. The end result? A box of broken supplies, wasted. (I haven't gotten one to do a post-mortem, but first suspect would be capacitors and physically cracked components due to differential thermal expansion.)

It's not just proper silicone rubber gaskets; silicone rubber leads and mains cables have exactly the same issue.

Have you tested your "silicone" leads? If they burn, they're not actually silicone, just over-plasticized PVA, which has completely different performance to silicone, but at room temperature feels pretty much exactly the same to a human. If you never use them in extreme conditions, you usually only notice when your "silicone" leads somehow embed themselves in the plastic container or pouch (Aneng AN8008 pouch and the test leads it comes with in particular) they're in. That's because the extra plasticizer slowly leeches off the leads, and if what it leeches onto is also plastic compatible with that plasticizer, it ends up fusing the two plastics together. Better than gluing, too.

I did get him to spend an extra ten euros for a proper silicone rubber exhaust hose for the new heat pump he installed. It is not critical for operation, but I think it is nice to know that you don't one nice day in the spring find out that some of the yucky condensate is now leaking along the wall both inside and outside, because the hose couldn't handle the environmental changes without cracking.

(Edited to add: I am using the term "silicone rubber" here for the family of materials, including buna-n AKA nitrile rubber, with "rubber" referring to the properties of said materials and not specifically to isoprene or isoprene derivatives.)

#102 Reply
Posted by Nominal Animal on 27 Jun, 2021 12:25
Quote from: brucehoult on 27 Jun, 2021 11:04
Linus complains that [variable length arrays on stack] generates less efficient code than a statically-sized array. True, but it's a lot more efficient than malloc() at the start of the function and free() at the end.
Well, not "a lot", really. A LOT of Linux kernel code now uses page-sized allocations for temporary data, and there had been a lot of comparisons done when moving from VLAs-on-stack to these explicit allocations (via domain-specific get/put allocators/deallocators), especially in the filesystem layers.

A scope-assisted no-realloc heap allocator would be very nice. Within each scope, allocating more memory for local data would basically boil down to extending the stack (but that stack pointer probably being at say TLS rather than in a dedicated register). Passing out of a scope would simply reset that data stack pointer to a previous value, essentially freeing everything without any explicit free() type calls. So, it would really boil down into a secondary stack for local data, without a hardware stack pointer register.

I have played some with "stack buckets", basically data structures implementing stacks as a bucket brigade. This has the benefit of letting the data stack scatter in the address space. It is surprisingly cheap and robust, but as usual, I never tried to add support for that into a C or C++ compiler, only experimented it with pure C in more-or-less freestanding environment. If you have 48 or 64 bit address space, I would not bother; a single linear growable/shrinkable region (in the sense of whether it is backed by actual RAM or not) should work better.

Quote from: brucehoult on 27 Jun, 2021 11:04
I think I can safely say that I have never written a dynamically sized array in C. I do however use alloca() sometimes, which amounts to the same thing.
I have; often. But those functions always have explicit checks on the size. I never use VLAs when the function might be used in a recursive call chain.

With thread pools and worker threads, I do the opposite: I keep the stack use to an absolute minimum, and use malloc()/free() for everything. This is important, because then one can create the threads with much smaller stacks –– I nowadays tend to 2*PTHREAD_STACK_MIN –– reducing the cost of worker threads in terms of virtual address space used.

(Pool workers end up not doing many of those, because the work tends to be already packaged in a suitable bucket. Typical ones I use are things like work space, when additional memory is needed to speed up the processing, with the work space being "owned" by the thread and not any particular work. I do heavily reuse freed work buckets, too, so the actual number of malloc()/free() calls in my pool workers tends to be rather minimal.)

Quote from: dunkemhigh on 27 Jun, 2021 11:16
Efficiency counts for nought if you run out of stack.
Very true. My beef with you and Ataradov can be similarly simplified to "An unreliable heuristic telling you a problem might or might not have occurred is speculation. Please do not speculate with other peoples work and data."

#103 Reply
Posted by brucehoult on 27 Jun, 2021 12:53
Quote from: Nominal Animal on 27 Jun, 2021 12:25
A scope-assisted no-realloc heap allocator would be very nice. Within each scope, allocating more memory for local data would basically boil down to extending the stack (but that stack pointer probably being at say TLS rather than in a dedicated register). Passing out of a scope would simply reset that data stack pointer to a previous value, essentially freeing everything without any explicit free() type calls. So, it would really boil down into a secondary stack for local data, without a hardware stack pointer register.

I fail to see the point of this, assuming you're not on a 6502 or something.

Appropriately sizing one stack per thread can be a challenge. Figuring out the right size for *two* stacks per thread, and the balance between them, just seems harder.

#104 Reply
Posted by gf on 27 Jun, 2021 13:25
Quote from: Nominal Animal on 27 Jun, 2021 12:25
A scope-assisted no-realloc heap allocator would be very nice. Within each scope, allocating more memory for local data would basically boil down to extending the stack (but that stack pointer probably being at say TLS rather than in a dedicated register). Passing out of a scope would simply reset that data stack pointer to a previous value, essentially freeing everything without any explicit free() type calls. So, it would really boil down into a secondary stack for local data, without a hardware stack pointer register.

Reminds me a bit of the GNU obstack library (not exactly, but some similarities).

Quote from: brucehoult on 27 Jun, 2021 12:53
I fail to see the point of this, assuming you're not on a 6502 or something.
Appropriately sizing one stack per thread can be a challenge. Figuring out the right size for *two* stacks per thread, and the balance between them, just seems harder.

One advantage I see is that overflow of the separate allocation stack can be easily and reliably detected, and max. usage can be tracked (w/o support from the compiler or hardware).
Furthermore, the "stack" of this allocator could also be segmented (i.e. a list of big chunks), so that it could even grow dynamically by adding additional chunks/pages on demand.
(Depending on the particular use case, dynamic growth may or may not be desired, of course.)

Edit:

I just noticed that gcc and llvm support a "split stack" / "segmented stack" feature, too, for the regular call stack. The generated code seems to be OS/platform-specific, though, and the feature is obviously not available for all platforms (I just tried a few ones on godbolt.org, e.g. https://godbolt.org/z/hbnGYsxqx).

#105 Reply
Posted by PlainName on 27 Jun, 2021 15:11
Quote from: dunkemhigh on 27 Jun, 2021 11:34
Ah, yes. As I say, not enough coffee yet $:=\$

I realise the example was aimed at showing displeasure of a feature, but I think it illustrates something really well (unintentionally).

The example line is:

Code: [Select]
char c[strlen(a)+strlen(b)+2]; // Urgh! What where they thinking!
And the first glance of this triggers the off-by-one reflex because there are two strings, and as any fule no, a string has a hidden nul. Thus the copy is clearly assumed to copy two nuls, although only one is actually required.

It's only if one looks further down the function that the added space is noticed. Or not noticed - these things are easily overlooked, and since strcat() is used the terminating nul is dealt with automatically. (You can see when the coffee was needed!).

This is an excellent example of the value of apparently useless comments. Why use 2 instead of 1? Anti-commenters would say that it's obvious from the code and you don't repeat yourself, but IMV an appropriate comment here would be as critical and useful as a coded range check.

#106 Reply
Posted by PlainName on 27 Jun, 2021 15:19
Quote
Quote from: dunkemhigh on Today at 01:03:23

it's usually possible to know where the limit of the stack is

You still do not seem to be able to differentiate between "heuristic" and "deterministic", it seems.

I am well aware of the difference. I think you forget that we are only here discussing this because we can present alternative views, or want to dig a bit deeper into stuff. If the only thing we said was "Yes, agree 100%" the board would be a dead dodo within a month.

And it must be recognised that pragmatic engineering doesn't consist of implementing ivory-towered ideals but is mostly about 'good enough' solutions. Sure, you can wish for clever whizzy stuff, but making something here and now is a matter of balancing compromises. Sometimes a compromise is indeed 'good enough', and I think a decently implemented canary used appropriately falls into that category.

#107 Reply
Posted by Nominal Animal on 27 Jun, 2021 15:49
Quote from: brucehoult on 27 Jun, 2021 12:53
Appropriately sizing one stack per thread can be a challenge. Figuring out the right size for *two* stacks per thread, and the balance between them, just seems harder.
The second one grows and shrinks dynamically; it is not sized beforehand.

Quote from: dunkemhigh on 27 Jun, 2021 15:19
Quote
You still do not seem to be able to differentiate between "heuristic" and "deterministic", it seems.
I am well aware of the difference.
Yet, you insist on counterarguments that hide their core statement in weasel words like "usually" that mean nothing when considering security or robustness.

To simplify, your (both your and Ataradovs) arguments seem to boil down to "I don't see why you want deterministic behaviour; the heuristics we have right now are sufficient". My counterargument to that is "the shit quality of software we have now, shows that argument is patently false –– so much so to be utterly ridiculous, really".

What gives? Me fail English, or you getting angry for making me realize you're basing your work product on belief and statistics instead of science, math, and engineering?

(Oh, and make no mistake: I much prefer you – everybody – disagreeing with me over agreeing with me, because I can only learn more in the first case. I don't like people enough to get a dopamine kick when someone agrees with me; I only get those when I learn something new and useful, or someone else discovers something with my help. Agreeing with me is useless, except when incidental, like when adding detail or scenarios or discoveries that support the approach or opinion.)

#108 Reply
Posted by Nominal Animal on 27 Jun, 2021 16:36
Quote from: gf on 27 Jun, 2021 13:25
One advantage I see is that overflow of the separate allocation stack can be easily and reliably detected, and max. usage can be tracked (w/o support from the compiler or hardware).
Furthermore, the "stack" of this allocator could also be segmented (i.e. a list of big chunks), so that it could even grow dynamically by adding additional chunks/pages on demand.
(Depending on the particular use case, dynamic growth may or may not be desired, of course.)
Without dynamic growth, it is no better than a single pre-sized stack, and the extra computational cost makes it not worth the effort.

(The "list of big chunks" is what I used term "bucket brigade" for.)

Additional benefits include things like trivial runtime control of resource use during data processing. Like the stack waterline control registers (that I'd love to use for detecting stack overflow), a scope could trivially restrict the amount of local/automatic memory any sub-scopes to it may consume. In systems programming, this is an useful feature, because that sort of a limit corresponds well to per-request limits: it provides a simple way to limit the amount of memory used to service a request, without having to reimplement your own memory allocator like Apache Runtime does.

So, in a very real sense, it is just a separate malloc()/free() interface intended and optimized for local data storage; leveraging the fact that there is no realloc(), and allocations and deallocations always occur in opposite order. (That is, if A is allocated before B, then B is always freed before A. No holes possible.)

But then, because actual call stack would be used in small, fixed-size units, you could rely on a single stack guard page to detect the need for dynamically growing the stack – letting you switch from pre-sized stacks to dynamically growing ones.

(Currently, it is not feasible in practice, because stacks grow down, and when objects are initialized, they are usually initialized in increasing addresses. On AMD64, a stack-based allocation (alloca(), for example), subtracts the desired size from the stack pointer to obtain the starting address of the allocated memory. When that memory is first accessed, it is likely to be accessed at its initial address or close by. This means that when objects are allocated on stack, first accesses are often quite far from the previous stack pointer, and easily beyond a single stack guard page. To work around this, you could have a SIGSEGV handler that examines the current stack pointer (or stack frame base address, if one is used) and compares it to the current size of the stack; and if the difference is within acceptable limits, simply attempts to grow the stack mapping to cover the stack pointer and/or stack frame base address. It then lets the kernel retry the same instruction. If it was a valid stack access, it now succeeds, with the stack having grown dynamically to sufficient size. In theory it should work, but the context in which an userspace SIGSEGV handler is executed even in Linux is so fragile, I am not sure if it would be robust enough in practice.)

For example, if the compiler agreed that accesses to a new stack frame must be done in steps of less than half a page apart – even inserting dummy reads if necessary –, then we could immediately switch to dynamically growing stack segments, using just two guard pages per stack. (The first one is the functional one, the second one acts as a pre-allocation without memory backing, making the SIGBUS handler more robust against temporary mmap() allocation failures.)

@brucehoult:

All that said, I do perfectly understand that it is difficult to see the benefits of any such, without seeing code that implements it.

I could whip up an example library this afternoon, but omitting making the real stack dynamic (as opposed to fixed size). Would it be useful? Interesting?

Also, I was thinking of starting a separate thread here, with the title "Mind over bugs: the C pointer edition", where I explain how and why changing how one sees or considers C pointers, makes an entire class of bugs (namely, the annoying buffer underrun/overrun ones) transparent and much, much easier to avoid. All C programmers, even beginners, as a target audience. It would be better as a video, Dave style (rather similar to the one about capacitance multipliers, actually), but since I have the on-screen personality of a slightly annoyed gerbil, I can't do it. It might spawn a few other useful approaches in quashing C pointer bugs from other members such as yourself (never underestimate the value of anecdotes for those still learning!), and I for one would be very interested to read those.

But, like my dear friend Blabbo is Verbose says, I might already be causing more discord by my lengthy posts here than their content could ever offset.

#109 Reply
Posted by SiliconWizard on 27 Jun, 2021 17:21
A dynamic-size stack would be pretty inefficient. Obviously depending on the memory layout, it could require being reallocated, and thus possibly relocated on the fly. I don't think this would be the best idea, although, if it's a rare enough event (assuming not on a real-time system), it would certainly be better than a crash. If you have a reliable way of estimating the stack usage at all times (which was actually what I was after and the start of the whole discussion), then you can certainly implement this. You'd just need to pair this with an exception if the current stack usage approaches the max size by a predefined amount (I would favor this approach rather than waiting for an overflow to actually happen), and in the exception handler, just reallocate the stack (in the realloc() C sense, thus copying its current content if it has to be relocated, and update the current stack pointer accordingly). Why not.

With that said, there is a common conception that heap allocation would be more "reliable" than allocating stuff on the stack. This just isn't true. It mostly comes from two facts: one, that on most systems these days, return addresses are stored in the same stack as local objects, which is pretty bad (we already discussed this in other threads), and second, that again on most common systems and programming languages, the allocated stack is usually much smaller than the space reserved for the heap, because heap allocation is favored. We can think of almost-fully stack-based languages that would make this irrelevant. I think we discussed this before also, but stacks have benefits too. Such as automatic memory management at very low cost.

Although risky as many other C features, I wouldn't completely reject VLAs either. It's pretty common, for instance, to allocate fixed-size arrays as local variables and pick a max size. Using VLAs can make more efficient use of the stack. But I agree they introduce a whole range of new potential bugs, and are almost impossible to analyze statically, so security-wise, they aren't that good. But just note that some of the bugs they allow are the same as the ones made possible with heap allocation.

#110 Reply
Posted by Nominal Animal on 27 Jun, 2021 20:58
Quote from: SiliconWizard on 27 Jun, 2021 17:21
A dynamic-size stack would be pretty inefficient. Obviously depending on the memory layout, it could require being reallocated
No, that won't work. We do not know which of the values in the stack are addresses to the stack, so cannot do a fixup pass. In other words, the stacks must never move.

We can allocate new pages (and if using a bucket brigade approach, the local data stack does not need consecutive address space), and code can choose to check and release unused stack pages back to the kernel (shrinking the stack), but not move or reallocate it.

The overhead from using a dynamic hardware stack is from the SIGBUS/SIGSEGV signal on first access to each new stack page. Typical work done in the handler is one mprotect() call, and one mmap() call, plus an update to at least one TLS variable (keeping the current stack size; another variable is a set of flags, one of which indicates whether the stack can be grown by at least one page or not).

Perhaps a better way to think about it, is to compare to using a PROT_NONE mapping for the entire stack address space except for the initially used pages. Whenever a SIGBUS/SIGSEGV fires, we essentially do an mprotect() call, telling the kernel that it needs to switch that page from inaccessible (PROT_NONE) to backed by RAM (PROT_READ|PROT_WRITE) and populated.

(So why don't we do just that instead? Because no resources are really saved if one does it this way. You then do in userspace what the kernel would do automatically.)

The similarity to the fixed-size stack is that stack accesses up to the current stack size has no overhead. None; no difference. When the stack size exceeds that estimate, a SIGBUS/SIGSEGV handler gets evoked. The process itself implements it, and its purpose is to decide whether to kill the thread or to grow the stack. Without dynamically growing stacks, such policy can only be done at the function scope, checking whether the current stack pointer or stack frame is past some previously set limit; or, if the exact amount of stack needed by the next scope is known (i.e., this is possible to do in a compiler, not so much in ones own C without help from the compiler) checking whether the needed amount would put the stack beyond the previously set limit.

Furthermore, there is no real need to spend resources for the virtual address space mapping ("address reservation", as is done with PROT_NONE), genuinely freeing resources from rather large default stacks (8 MiB per thread typical in Linux on AMD64).

It may sound like it amounts to only a little overhead, but I do sometimes write code with dozens of worker threads in pools; hundreds, when using a dedicated thread per client. Address space is cheap but not free. Especially for service processes run on virtual hosts, I would rather trade a bit of performance for smaller memory requirements.

Because the SIGBUS/SIGSEGV handler uses `mmap(nextpage, page, PROT_NONE, MAP_PRIVATE | MAP_ANONYMOUS | MAP_GROWSDOWN | MAP_STACK, -1, 0)` to ask for a new stack guard page just beyond/next to the current one, the kernel either provides it that address, not at all, or provides a different address (which will be rejected/freed, and the stack marked no longer growable), no existing mappings will be clobbered. Note the lack of MAP_FIXED flag; this is crucial.

I do not see ordinary code ever shrinking the stack; only a worker pool library implementation might (when returning a worker), but possibly not even those. I suspect it would be more efficient just to let ballooned workers die when they complete their current work, and create fresh new ones instead. However, exactly how large a thread stack would be allowed to grow is a policy question, something that only the userspace process itself can and should decide; and this lets it do so at runtime, even changing its mind, instead of deciding a hard limit up front.

Let's estimate, then, the overhead for a real world case. We start with say PTHREAD_STACK_MIN (16,384 bytes on AMD64 in Linux) stack size, and during the lifetime of the thread, it expands to the default maximum of 8,388,608 bytes, in increments of single page (4,096 bytes in this particular case). That means an overhead of 2,048 SIGBUS/SIGSEGV signals, and about twice that in system calls. Is that a lot? No, it is not. To be easily detectable, it should happen within one second or so; anything longer, and it will be lost in the noise, and be very difficult to notice.

This is exactly why I wondered if an example implementation would help, because even trivial test cases would show the *real* overhead associated with the approach. Unless you have already implemented something similar, human intuition is unlikely to give an useful initial guess as to what the overhead in reality is.

(I have only done enough testing myself to have that as an *opinion*; but I know that even if I do a test case, if I am the only one who tests it, the results are still only enough to base ones own opinion on. Stuff needs to be rigorously investigated and pushed to the limit, and I think this kind of stuff needs more than one viewpoint to properly examine.)

#111 Reply
Posted by SiliconWizard on 27 Jun, 2021 21:09
Quote from: Nominal Animal on 27 Jun, 2021 20:58
Quote from: SiliconWizard on 27 Jun, 2021 17:21
A dynamic-size stack would be pretty inefficient. Obviously depending on the memory layout, it could require being reallocated
No, that won't work. We do not know which of the values in the stack are addresses to the stack, so cannot do a fixup pass. In other words, the stacks must never move.

Ah, I was thinking too fast and overlooked that.

So I guess what you are suggesting relies on virtual addressing to be able to resize it dynamically without having to move the already existing stack? That doesn't sound practical for small embedded stuff obviously though. It requires a full-fledged MMU with corresponding software support.

OTOH, it's on a small target that using "ad-hoc" stacks (meaning: as small as possible) would really make sense, due to limited memory. On larger systems, that's usually not a big problem to largely oversize the stacks, even if that means some wasted memory. Just a thought. As I said above, if you can afford to reserve a large amount of stack, then you don't need to care all that much about it, except maybe for very specific cases, with very deep recursive functions.

#112 Reply
Posted by Nominal Animal on 27 Jun, 2021 21:45
Quote from: SiliconWizard on 27 Jun, 2021 21:09
That doesn't sound practical for small embedded stuff obviously though. It requires a full-fledged MMU with corresponding software support.
For sure. I do not believe separate data stacks or dynamically growing stacks are useful on embedded targets. I see some use for them in systems programming: specifically, daemons with lots of threads. (In Linux, identity (uid, gid, supplementary gids) and capabilities (CAP_CHOWN et cetera) are per-thread properties, not per-process ones; the standard C library just reserves a realtime signal for synchronizing them across the process, because POSIX says they are per-process properties.)

On embedded targets, I'd love to have that hardware stack pointer relative effective address checks as a tool for "cancelling" recursive processing when it cannot complete within currently available stack space. It would rely on the compiler generating code that always uses stack-pointer relative addressing (even a dummy load) before assigning it to a pointer, so I guess it would make more sense to work on integrating stack address check support into a compiler one uses instead.

Hey, does anyone have any contacts with clang/llvm AVR developers? If they did not reject such an extension (function preamble and stack allocations checking the resulting effective address against a runtime limit) as an idea before seeing some patches first, I might just try and find out what they'd think of some suggested patches. As mentioned before, I do not have the ability to push through GCC levels of cliqueness, though.
It would not help with buffer under/overrun bugs, but it would give those of us interested almost complete control over stack overflows. (Almost, because correctness when an interrupt causes the stack overflow is something I have not worked out yet. Is complicated.) And, of course, it would also work as runtime stack use instrumentation.

Quote from: SiliconWizard on 27 Jun, 2021 21:09
On larger systems, that's usually not a big problem to largely oversize the stacks, even if that means some wasted memory.
Except for those pesky daemons with lots of threads running on virtual hosts. You end up paying money for that memory address space, even though it won't be used.

How people are happy to run Java Virtual Machines with their hunger for fixed-size RAM, does still surprise me a bit. I'm not saying it cannot or should not be done; I'm saying it is part of the crappiness of the current software design style I'd like to do something about.

Just like I'd love to do something about how silly and inefficient the core design is for all molecular dynamics simulators whose sources I've seen. They really are straight from the seventies. With the latest CUDA snazziness sprinkled on top.

Yep; I may be wailing at the windmills here, and completely missing more important problems I might actually help. But until shown proof and not just opinions, I have no better information to work on than my own experience and observations.

#113 Reply
Posted by brucehoult on 27 Jun, 2021 22:07
Quote from: Nominal Animal on 27 Jun, 2021 15:49
Quote from: brucehoult on 27 Jun, 2021 12:53
Appropriately sizing one stack per thread can be a challenge. Figuring out the right size for *two* stacks per thread, and the balance between them, just seems harder.
The second one grows and shrinks dynamically; it is not sized beforehand.

Huh?

The stack pointer of course moves dynamically, but you need to decide on the initial stack pointer at the outset, and also decide on the initial stack pointer for the thread with its stack at the next lowest place in the address space, and decide on how far apart to put them. You can't move them afterwards (assuming things are allowed to contain pointers into the stack, including from other parts of the stack, and that you don't know where those pointers are i.e. we're talking about C still)

This is easier for something with a MMU where you can just place things very far apart, especially with a 64 bit address space, but you still have to decide on the maximum stack size beforehand.

#114 Reply
Posted by brucehoult on 27 Jun, 2021 22:21
Quote from: SiliconWizard on 27 Jun, 2021 17:21
With that said, there is a common conception that heap allocation would be more "reliable" than allocating stuff on the stack. This just isn't true. It mostly comes from two facts: one, that on most systems these days, return addresses are stored in the same stack as local objects, which is pretty bad (we already discussed this in other threads), and second, that again on most common systems and programming languages, the allocated stack is usually much smaller than the space reserved for the heap, because heap allocation is favored. We can think of almost-fully stack-based languages that would make this irrelevant.

There is a large and important class of programs which absolutely can not be fully stack based.

This is programs based on an event loop, with an amount of state retained between iterations that is not known in advance.

This includes any interactive program (whether character based or GUI) where the user is creating or maintaining a document or in-memory database of some kind.

#115 Reply
Posted by gf on 27 Jun, 2021 23:03
Quote from: Nominal Animal on 27 Jun, 2021 21:45
On embedded targets, I'd love to have that hardware stack pointer relative effective address checks as a tool for "cancelling" recursive processing when it cannot complete within currently available stack space. It would rely on the compiler generating code that always uses stack-pointer relative addressing (even a dummy load) before assigning it to a pointer, so I guess it would make more sense to work on integrating stack address check support into a compiler one uses instead.

Hey, does anyone have any contacts with clang/llvm AVR developers? If they did not reject such an extension (function preamble and stack allocations checking the resulting effective address against a runtime limit) as an idea before seeing some patches first, I might just try and find out what they'd think of some suggested patches. As mentioned before, I do not have the ability to push through GCC levels of cliqueness, though.
It would not help with buffer under/overrun bugs, but it would give those of us interested almost complete control over stack overflows. (Almost, because correctness when an interrupt causes the stack overflow is something I have not worked out yet. Is complicated.) And, of course, it would also work as runtime stack use instrumentation.

Already now, gcc's split stack feature basically needs to check the stack pointer against the (thread-specific) limit, in order that a new segment can be allocated when the current segment is exhausted. So I guess that replacing the __morestack... routines (so that the allocation of additional segments fails) would degrade the functionality to an overflow checker for the one and only stack segment. As always, the devil will be certainly in the details (you already mentioned interrupts, etc.). Llvm obviously has a similar feature, called "segmented stack".

#116 Reply
Posted by Nominal Animal on 28 Jun, 2021 00:44
Quote from: brucehoult on 27 Jun, 2021 22:07
decide on how far apart to put them.
Exactly; but that does NOT determine the size of the stack, only the maximum possible size of the stack.

Quote from: brucehoult on 27 Jun, 2021 22:07
This is easier for something with a MMU where you can just place things very far apart, especially with a 64 bit address space, but you still have to decide on the maximum stack size beforehand.
Yes, exactly, and this is only really useful there (say anything with at least 48-bit address space).

To repeat – because I now see I did not indicate in any way that I intended this completely separate from stack stuff on embedded devices; I apologise –, this is useful in cases where you have lots of threads, reducing the up-front cost of threads by trading a small performance tradeoff whenever the amount of stack space needed crosses a new page boundary.

A compiler could generate code for this right now. (In Linux, signal handlers can run on an alternate stack, so if one required that, there would be no "interrupts" to worry about, and a simple function preamble check – "does adding a new N-byte stack frame run afoul of the limit foo in TLS?", with room enough for full processor state and a return address beyond the set limit – would suffice.)

By splitting the return address and saved register state from local data storage, the physical stack would be accessed sequentially (or at most separated by full processor state size), so a virtual memory based guard page (pages being more than twice larger than complete userspace process state and a return adress) mechanism would work also.

The local data stack could be separately instrumented for verification on a function by function basis, to help detect buffer under/overrun bugs, and it might (or might not) help convert the nastiest heisenbugs overwriting return addresses to more easily detected and debugged bugs... but I am more interested in what kind of changes we could introduce to how us human programmers use these things. (Yet, the idea of having full guard pages around the current local data stack frame is kinda interesting as a debugging tool. You know, if one could just mark/label a function as suspect, and have the compiler generate extra-special but slow code generating that local data stack frame. Again, almost doable with current compilers; only the automatic release of the memory-mapped stack frame when it passes out of scope is iffy.)

Even having the local data stack grow upwards instead of downwards might make a measurable difference in buffer overflow bug behaviour, simply because this way an overflow is less likely to corrupt local data belonging to parent scopes. I don't know for sure, but it would be interesting to find out and check; perhaps a pattern pops out that helps eliminate those annoying bugs in plain-ol' C, that we could shoehorn back into traditional stack implementations too. Or it might be an excuse for bad programmers to write even worse code, in which case I'd end up chewing my own foot off in anger and disappointment.

If the data stack concept is morphed into context, we could look at efficient coroutine implementations using the same. (The local data stack or scope being the same for each coroutine, but the real stack used to maintain call chain and register data.) But that passes beyond C.

As a side note, it is interesting to examine how C processes asked more data memory from the kernel prior to memory mapping interfaces: the brk() function.
Essentially, it maintains the size of the uninitialized data section (or program break, the end of the uninitialized data section) in exactly the way a local data stack would be maintained by functions requiring space for local variables.

Nothing new under the sun, eh?

#117 Reply
Posted by brucehoult on 28 Jun, 2021 01:18
Quote from: Nominal Animal on 28 Jun, 2021 00:44
Quote from: brucehoult on 27 Jun, 2021 22:07
decide on how far apart to put them.
Exactly; but that does NOT determine the size of the stack, only the maximum possible size of the stack.

Uhhh.

When I say something like "you have to figure out the size to allocate for the stack" I'm of course talking about the maximum possible size the stack can grow to. The stack starts off empty.

Quote
As a side note, it is interesting to examine how C processes asked more data memory from the kernel prior to memory mapping interfaces: the brk() function.
Essentially, it maintains the size of the uninitialized data section (or program break, the end of the uninitialized data section) in exactly the way a local data stack would be maintained by functions requiring space for local variables.

Nothing new under the sun, eh?

Sure. You can call sbrk() with a negative increment (or call brk() with a smaller value than you previously called it with) and thus use it as a stack. But not if, as is usual, you used that space as a heap and there are objects allocated and not free'd at the end of the heap space. If you get to some point where the last object(s) in the heap are free'd then you can move the break down. This *might* return memory to the OS and make it available for other processes.

This is more likely in a (compacting) GC environment than in a malloc()/free() one.

#118 Reply
Posted by Nominal Animal on 28 Jun, 2021 08:46
Quote from: brucehoult on 28 Jun, 2021 01:18
When I say something like "you have to figure out the size to allocate for the stack" I'm of course talking about the maximum possible size the stack can grow to. The stack starts off empty.
When we have virtual memory, and this indeed requires that to be useful, that is not a useful simplification.

We do need to consider the stack a mapping to physical memory here. The current size of this mapping is the current size of the stack. Assuming the stack grows downwards, the stack pointer (roughly, *) points to the smallest address currently in use in the stack. Above that the stack contains data needed by code in the current call chain; below that is unused stack space.

(* ignoring red zone and such, and details like whether the stack pointer points to the first free element or the last used element, and so no.)

Resizing a stack only changes the amount of unused stack space. In virtualized environments, the total memory available to the virtualized system is often controlled via a similar control mechanism; memory ballooning.

Why would anyone have the stack mappings vary in size? For the exact same reason memory ballooning is used and useful in virtualized systems: that way we do not tie up resources in preallocations, and can instead use the resources where they are needed. We trade some run time cost –– small delays in initial accesses –– for this.

Right now, on AMD64 in Linux, the default stack size is 8 MiB, and typical virtual host environments start at 2 GiB of RAM.
If you use a kernel which initializes and/or fully backs each stack with RAM, you are limited to less than 256 processes and threads, because their stacks alone consume all available RAM.

Even when the stack is only mapped – virtual memory data structures set up to describe everything, but instead of reserving actual RAM for all of it, most of the pages are marked "unpopulated" –, we run hard into the limit called overcommit.

Overcommitting means having more mappings set up than you can ever back with available RAM or swap. Essentially, you are relying on never having to back all your promises.

Overcommit has to be limited, because even a single tiny moment when more RAM backing is needed than is available, leads to the kernel having to kill running processes. This is what the OOM killer in Linux does. It is pretty darn smart, but it isn't an AI, and configuring it optimally for a known workload is harder than database optimization. (You know it if you've done both. I hate databases, and I'd still prefer working on one over trying to get the OOM killer to behave in a deterministic, workload-optimal manner.)

So, the answer is not, cannot be to just leave those stacks unbacked by RAM, because we know that vm.overcommit_ratios above 100 or so tend to make systems unstable.
And that ratio only doubles the maximum number of threads and processes with those 8 MiB stacks on 2 GiB system to 512.
(The typical default on desktop systems is 50, for a 50% overcommit. Even that is known to cause troubles with dense workloads – those that do use all the memory they request from the kernel.)

What if most processes only need about 1 MiB of stack during their entire lifetime, with rare ones – but it being unpredictable exactly which ones! – needing more, perhaps up to that 8 MiB limit? Instead of preparing backings for 8 MiB stacks, we create smaller mappings, and when necessary, "waste" some CPU time at run time to decide whether a process gets it stack mapping transparently expanded, or whether that process gets killed then and there.

I'm sure you can do the math. The above numbers are not indicative of any specific workload, but even as hand-waving, they are in the right ballpark or scale.

Perhaps one is tempted to say that it would be better to not count stack pages against the overcommit limit. That way lies madness: it is exactly the precise nature of the overcommit accounting that makes it the least bit useful in the first place. Add fuzzyness around the accounting, and the first thing that happens is that overcommit gets disabled on every single virtualized instance. Besides, overcommitting those stack pages does contribute to the overall stability issues when the overcommit factor is set too high, so relaxing their accounting rules would be just a particularly nasty footgun.

To a first approximation, we can say dynamically resizing stack mappings does to threads and processes running in virtualized systems, what memory ballooning does to previously overprovisioned virtualized systems. Needs and reasons are the same.

Quote from: brucehoult on 28 Jun, 2021 01:18
Sure. You can call sbrk() with a negative increment (or call brk() with a smaller value than you previously called it with) and thus use it as a stack. But not if, as is usual, you used that space as a heap and there are objects allocated and not free'd at the end of the heap space.
Right, right. My point was that "heap" (in the sense of having a C function to ask the OS for more memory) really started as a separate data stack. It is nowhere near as odd a concept to C as one might think.

#119 Reply
Posted by PlainName on 28 Jun, 2021 11:09
'Dynamic' is the killer here. If you statically assign everything there is no problem. If you dynamically assign/create/delete stuff you have implicitly overcommitted and there is the potential to go tits up. Thus with dynamic anything you have to check that asking for stuff has succeeded, and have a plan to manage the situation if it fails. With the likes of malloc() that's easily done (which isn't to say easily coped with), but the stack is at too low a level for that to be appropriate.

Automatically increasing stack space by nicking memory from somewhere else (if it were possible) isn't a fix. It merely puts off the moment of truth until more memory is needed to replace the stolen memory. Rather, it is a fix in the same way that malloc/free is and suffers the same issues.

#120 Reply
Posted by golden_labels on 28 Jun, 2021 11:43
Quote from: brucehoult on 27 Jun, 2021 11:04
Sure. Because C "arrays" in general are not arrays, but only allocating space and pointing at it.
They are not. Arrays are proper C objects and they’re accessible as arrays, not pointers. Don’t confuse object type with implicit conversions it may undergo. A few examples:
Code: [Select]
#include <stdint.h> #include <stdio.h> #include <stdlib.h> #include <stddef.h> struct Foo { int x; char y[100]; int z; }; struct Bar { int x; char* y; int z; }; int main(void) { printf("Foo: x:%zu z:%zu\n", offsetof(struct Foo, x), offsetof(struct Foo, z)); printf("Bar: x:%zu z:%zu\n", offsetof(struct Bar, x), offsetof(struct Bar, z)); return EXIT_SUCCESS; }
Code: [Select]
#include <stdio.h> #include <stdlib.h> int main(void) { int array[10]; printf("%s", _Generic((&array), int*: "int*", int(*)[10]: "int[10]")); return EXIT_SUCCESS; }
Code: [Select]
void bar(int** ptr); void foo(void) { int array[10]; bar(&array); // Would be no error, if `array` was a pointer }In C++, in which arrays work in a similar manner, it is much easier not notice, as you can have C++ references to arrays as a type.

#121 Reply
Posted by brucehoult on 28 Jun, 2021 11:44
Quote from: Nominal Animal on 28 Jun, 2021 08:46
Quote from: brucehoult on 28 Jun, 2021 01:18
When I say something like "you have to figure out the size to allocate for the stack" I'm of course talking about the maximum possible size the stack can grow to. The stack starts off empty.
When we have virtual memory, and this indeed requires that to be useful, that is not a useful simplification.

We do need to consider the stack a mapping to physical memory here. The current size of this mapping is the current size of the stack. Assuming the stack grows downwards, the stack pointer (roughly, *) points to the smallest address currently in use in the stack. Above that the stack contains data needed by code in the current call chain; below that is unused stack space.

Y'know, I do actually know how virtual memory works. I've been using it for literally 40 years, and was using computers in the transition from a PDP11 program having 8 segments of 8 KB each, which had base and limit registers and could be transparently copied around and relocated in physical memory by the OS and swapped to disk and back (as a unit) on a heavily loaded system, to the VAX with paged virtual memory, a tree-structured page table, and a TLB.

I've *implemented* virtual memory. I've hacked Android to have compressed virtual memory (compressing and uncompressing pages in RAM instead of swapping them to/from flash).

Quote
Right now, on AMD64 in Linux, the default stack size is 8 MiB, and typical virtual host environments start at 2 GiB of RAM.
If you use a kernel which initializes and/or fully backs each stack with RAM, you are limited to less than 256 processes and threads, because their stacks alone consume all available RAM.

And that's fine on 64 bit, as long as no one wants more than 8 MB of stack, which can very easily happen if a program decides to load even a simple a JPG into a buffer on the stack instead of on the heap.

It would be much more sensible for a 64 bit system to put thread stacks at least a couple of GB apart. It costs almost nothing to do so, and you'd still be able to have a billion threads in one process. (Not as Linux threads -- you'd have to do your own thread manager)

If you try that on a 32 bit system then you'll only be able to create -- as you say -- 256 threads in one process.

The fact that most threads are probably using only a few kb and that is all that is mapped and allocated in physical memory is irrelevant to my point. My point is that on a 16 or 32 bit machine you can very quickly run out of VIRTUAL address space. You have to manage it.

I've worked on systems that have tens of thousands of threads in a single process, on Linux. For example an "Intelligent Network" system attached to a telephone exchange. Every active phone call that someone is charged for has a thread in the IN machine. The thread decides which physical number to connect the call to (can be different for e.g. 0800 numbers, possibly depending on the caller's location and other things), decides who should be charged for the call (caller, callee, someone else), checks to see if they have enough credit for the next 3 minutes / 1 minute / 1 second, and tells the exchange to connect them, and to ask again in 1 minute (or whatever).

Here in NZ, in ~2002, such systems were spec'd to handle 500 call initiations per second, with average call times of 3 minutes. That's up to 90000 calls being tracked, with a thread for each. Other clients we supplied the same software to, such as networks in Poland and Indonesia and India, had much higher call initiation rates.

You can program such things as state machines (and some systems do) but that's a far more complex thing to manage and very error prone. A thread abstraction with local variables and function calls that you don't have to return from before switching to the next thread is far more convenient. Especially when you business is coming up with and implementing and selling new IN features on a short time-line. For example, I helped implement a feature where people in shared accomodation could each have their own section of the phone bill on the shared phone. Whenever a chargeable call was initiated the system would go to an IVR menu and prompt for which resident was making the call and have them enter a PIN.

One of my contributions at that company in 2002/3 was to enable call processing code to be written as a thread per call, with local state and function calls etc, and transparently convert that to continuations.

The same principles led six or seven years later to a revolution in the ease and reliability of making complex web sites with complex interactions when systems such as node.js made it possible to create them using an abstraction of a thread per session instead of a state machine.

#122 Reply
Posted by SiliconWizard on 28 Jun, 2021 17:36
Quote from: brucehoult on 27 Jun, 2021 22:21
Quote from: SiliconWizard on 27 Jun, 2021 17:21
With that said, there is a common conception that heap allocation would be more "reliable" than allocating stuff on the stack. This just isn't true. It mostly comes from two facts: one, that on most systems these days, return addresses are stored in the same stack as local objects, which is pretty bad (we already discussed this in other threads), and second, that again on most common systems and programming languages, the allocated stack is usually much smaller than the space reserved for the heap, because heap allocation is favored. We can think of almost-fully stack-based languages that would make this irrelevant.

There is a large and important class of programs which absolutely can not be fully stack based.

This is programs based on an event loop, with an amount of state retained between iterations that is not known in advance.

This includes any interactive program (whether character based or GUI) where the user is creating or maintaining a document or in-memory database of some kind.

I don't agree with this in the least. The use of stack and heap is, IMHO, entirely dependent on the programming model, not on the applications.
We can absolutely write GUI programs with a stack only. With appropriate languages and implementations, of course.

No whether this would be a good idea/be worse/be better than the usual paradigms, that's a question that could take a whole thesis to answer. Or several.

#123 Reply
Posted by SiliconWizard on 28 Jun, 2021 17:50
Quote from: golden_labels on 28 Jun, 2021 11:43
Quote from: brucehoult on 27 Jun, 2021 11:04
Sure. Because C "arrays" in general are not arrays, but only allocating space and pointing at it.
They are not. Arrays are proper C objects and they’re accessible as arrays, not pointers. Don’t confuse object type with implicit conversions it may undergo.

I more or less agree with you there, but it is nitpicking depending on what exactly you mean by "accessible as arrays".

Bruce is right: in C, accessing arrays is implemented in totally the same way as any pointer. The only extra things an array gives you:
- Static allocation (that's not linked to how they are accessed);
- The sizeof operator gives you the size of the allocated memory;
- Some possibility of static analysis obviously not possible with mere pointers;
- Some compiler extensions/options may allow to automatically generate run-time out-of-bounds checks when accessing an array, but that's non-standard, not necessarily available and must be explicitely enabled. This of course affects performance.

If you exclude the last point, which is clearly non-standard, there is no difference how access is done between an array and a pointer.
Your example with _Generic is not per se about "access". It's just about the fact in C, arrays are indeed types of their own. Allowing what I mentioned above, plus the use of _Generic (beginning with C11) as you mentioned.

In the end, this just means arrays are not pointers in terms of data type. But "item" access is strictly the same.

In C++, things are a bit more elaborate, of course.

#124 Reply
Posted by PlainName on 28 Jun, 2021 17:56
Quote
- The sizeof operator gives you the size of the allocated memory;

Tells you the size of the element too.

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

There was an error while thanking

Thanking...

Go to page:

« 1 2 3 4 5 6 7 8 » All

Full site Menu

Navigation

Powered by SMFPacks Advanced Attachments Uploader Mod