Author Topic: Cortex-M3+ GCC unaligned access quirks (Read 6465 times)

abyrvalg · « **on:** September 22, 2023, 11:24:40 am »

Perhaps this could save someone from the same pitfall:

As we know, Cortex-M3 and above can handle unaligned memory accesses. Let's try the following code:

Code: [Select]

//compile with -mcpu=cortex-m3 -O2/-O3/-Os
extern void MyFn(int a, int b);

//Disable for now
//#define WITH_PACK

#ifdef WITH_PACK
    #define PACKED __attribute__((packed))
#else
    #define PACKED
#endif

typedef struct
{
    int a, b;
} PACKED MyStruct;

void DoTest(unsigned char *src)
{
    MyStruct *ptr = (MyStruct *)src;
    MyFn(ptr->a, 1); //OK
    MyFn(4, ptr->b); //OK
    MyFn(ptr->a, ptr->b); //Surprise!
}

- everything looks fine, the core should handle int field accesses even if the src pointer is unaligned, but in reality it would UsageFault on the 3rd MyFn() call

The reason is GCC combining sequential memory accesses into a single LDRD or LDM instruction and those, unlike single LDR, can't handle unaligned accesses.
The solution is to make MyStruct packed, even if it looks perfectly packed already. This doesn't split all accesses into individual bytes, GCC perfectly understands M3's capabilities and just issues two separate LDR instructions.

https://godbolt.org/z/TTKzq5dc1

Upd: -Wcast-align option detects such cases.
But the solution with packing the struct type feels wrong to me (1. it can cause unwanted packing if there are not so nice field types, 2. it breaks LDR->LDM optimizations everywhere, including places where the alignment is correct by design). The problem is not in a struct type itself, but in a more local bad pointer type. Does anyone know a better solution?

brucehoult · « **Reply #1 on:** September 22, 2023, 07:41:51 pm »

Here's the correct approved answer, in which a modern compiler will automatically use unaligned loads IFF the CPU supports them, and otherwise (e.g. on Cortex M0) will actually call memcpy or use aligned loads and shifts and stuff.

https://godbolt.org/z/v8dzv7qfj

Code: [Select]

void DoTest(unsigned char *src)
{
    MyStruct s;
    memcpy(&s, src, sizeof(s));
    MyFn(s.a, s.b); // No surprise!
}

Code: [Select]

DoTest:
        mov     r2, r0
        sub     sp, sp, #8
        mov     r3, sp
        ldr     r0, [r0]  @ unaligned
        ldr     r1, [r2, #4]      @ unaligned
        stmia   r3!, {r0, r1}
        add     sp, sp, #8
        b       MyFn

Note how the compiler explicitly marks that those ldr are unaligned.

I'm not sure why gcc is failing to optimise away the stack frame and stmia. Clang does much better, though the stupid thing should just reorder the loads and avoid the extra mov.

https://godbolt.org/z/ar34s9j8a

Code: [Select]

DoTest:
        ldr     r2, [r0]
        ldr     r1, [r0, #4]
        mov     r0, r2
        b       MyFn

Has godbolt.org been REALLY sluggish for anyone else recently? Do we need to donate to Matt for more servers?

nctnico · « **Reply #2 on:** September 22, 2023, 07:57:45 pm »

Quote from: abyrvalg on September 22, 2023, 11:24:40 am

Does anyone know a better solution?

Yes. Never, ever casts structs onto pointers which are of a different type (especially void or byte sized pointers). That is bad coding practise and begging for a world of pain on your bare knees. You have a compiler that knows how to align variables types while adhering to memory access abilities of the platform. Don't throw that out of the window.

gf · « **Reply #3 on:** September 22, 2023, 08:47:27 pm »

Quote from: nctnico on September 22, 2023, 07:57:45 pm

Never, ever casts structs onto pointers which are of a different type (especially void or byte sized pointers). That is bad coding practise and begging for a world of pain on your bare knees.

How else would you do "type erasure", for instance when you call qsort() in order to sort an array of structs?
Or how can you memcpy() a struct w/o casting the struct pointer to void* ?

SiliconWizard · « **Reply #4 on:** September 22, 2023, 08:47:53 pm »

Quote from: brucehoult on September 22, 2023, 07:41:51 pm

Has godbolt.org been REALLY sluggish for anyone else recently?

Yeah. Some days it's very slow. Right now for me, it is relatively snappy.
As a quick note, I think the fact it triggers a recompile at each text change in the editor is a very bad idea. There should be a "compile" button instead, to avoid unnecessarily using sever resources.

brucehoult · « **Reply #5 on:** September 22, 2023, 09:05:59 pm »

Yes, I think the live recompile is past its use-by date and must increase server load by a factor of 10 or more.

nctnico · « **Reply #6 on:** September 22, 2023, 10:09:55 pm »

Quote from: gf on September 22, 2023, 08:47:27 pm

Quote from: nctnico on September 22, 2023, 07:57:45 pm
Never, ever casts structs onto pointers which are of a different type (especially void or byte sized pointers). That is bad coding practise and begging for a world of pain on your bare knees.

How else would you do "type erasure", for instance when you call qsort() in order to sort an array of structs?
Or how can you memcpy() a struct w/o casting the struct pointer to void* ?

When you memcpy, you'd copy one struct to another struct of the same type so the compiler deals with alignment. Same for qsort. Neither deals with the actual mapping of struct members to memory so alignment isn't an issue. But it is better to just tell the compiler struct A = struct B and let the compiler decide how to copy the contents. Where it comes to qsort: doing a bubble sort is quite easy so in plain C so I'd write the loop myself and keep using the right types. C is inviting people into bad code practises at many points. The mere fact something is possible, doesn't mean it is a good idea.

But the case the OP shows, is a typical example where things can go horribly wrong by taking a random pointer and mapping it onto a struct. Needing the 'packed' attribute is a red herring.

brucehoult · « **Reply #7 on:** September 23, 2023, 02:49:01 am »

Quote from: gf on September 22, 2023, 08:47:27 pm

Quote from: nctnico on September 22, 2023, 07:57:45 pm
Never, ever casts structs onto pointers which are of a different type (especially void or byte sized pointers). That is bad coding practise and begging for a world of pain on your bare knees.

How else would you do "type erasure", for instance when you call qsort() in order to sort an array of structs?
Or how can you memcpy() a struct w/o casting the struct pointer to void* ?

In the qsort compare function you know where the pointer came from so you know it is properly aligned if the pointer you passed in to qsort in the first place was, even though the type information on the pointer isn't sufficient to indicate that. Usually the call to qsort and the compare function it uses are located very close to each other in the source code, so you can see what is going on. In a slightly better C you'd be able to declare the compare function inline in the actual call to qsort, but oh well.

Code: [Select]

void sortByBs(MyStruct *p, int n){
    qsort(p, sizeof(*p), n,
        [](const void *x, const void *y) -> int {
            return (*(MyStruct*)x).b < (*(MyStruct*)y).b;
        }
    );
}

And, yes, once you go this far it would be better to just take p as a capture in the lambda and have qsort's arguments be only n and the lambda (and start index for generality) but that's a whole different library, not just a more convenient syntax.

Perkele · « **Reply #8 on:** September 23, 2023, 12:23:48 pm »

Now try to compile it with -O0 and look at the generated output.
With "-Os" compiler is doing exactly what you told it to do, its trying reduce the size of the compiled program.

Latest Keil MDK (Clang) generates almost exactly the same code with "-Oz".

Any decent static analysis tool should at least give you a warning for doing this kind of cast.

abyrvalg · « **Reply #9 on:** September 23, 2023, 04:17:50 pm »

I'm not blaming the initial (w/o PACKED) behavior, it matches the standard (wide type pointers are assumed to be aligned by default), sequential access optimizations are great, Cortex-M has LDM/STM for even longer sequences and I'm glad to see them being used.
The question is how to tell the compiler "this pointer is unaligned, do the best you can on this arch" with a minimal overhead. Although the memcpy approach proposed above solves the alignment problem, it is a trick relying on implicit things, possibly leading to serious overhead (imagine a half-KB sized struct from which we are pulling 3 last ints instead of my shorter example).

For people who gets triggered by struct casts here is an array version (imagine it came via UART as a part of some packet) producing an unaligned LDRD: https://godbolt.org/z/P6oETcMc7

A side note: it is quite, ehm, interesting to see such devotion to portability of code most probably (taking into account the forum section where we are) receiving that data in question from some completely non-portable peripheral. UARTs and DMA controllers are are non-portable but we continue to use them somehow, writing nice (or not so nice) drivers when we want to abstract them. Why don't just treat core's "data aligner" as yet another hw feature requiring some "driver"?
Personally I'd be totally ok with some #ifdef COMPILER_A #define UNALIGNED this_way #elif ... #else #error "Sorry, I haven't thought of your CellBE, QDSP6 or whatever" #endif (otherwise I'd be programming in Java or MicroPython

)

nctnico · « **Reply #10 on:** September 23, 2023, 05:50:38 pm »

Think about testing code on a PC and how much certainty that testing gives you your code will work as intended. The less code is platform dependant, the more certainty you have it will work on your platform because you have a greater test coverage. A typical embedded platform is much harder to debug software on compared to a PC. So having ifdefs for platform dependant stuff is only a last effort. If you need some fields from a 3kB struct, then use accessor functions that take a byte pointer + offset and shuffles byte-per-byte into the type you want to retreive from the struct. That will never fail and thus gives you certainty that code will always work as intended regardless the platform. Very handy if you work on protocol stacks where a myriad of other things can go wrong.

SiliconWizard · « **Reply #11 on:** September 23, 2023, 07:59:16 pm »

All this would of course depend on the project, complexity and performance requirements.

Personally, in most cases if I use packed structs (for a good reason), I'll arrange to make all fields naturally aligned. In particular for embedded dev, I almost never allow unaligned accesses, whether they are supported by the target or not.
If they can't be aligned, then I'll implement accessors / or a transformation of the data of some kind rather than direct access.

As always when it comes to non-portable code, if you absolutely have to write some, separate it as much as you can from the portable code, keep it at a minimum and document what is not portable and would need porting effort.

gf · « **Reply #12 on:** September 23, 2023, 08:22:45 pm »

Quote from: abyrvalg on September 23, 2023, 04:17:50 pm

The question is how to tell the compiler "this pointer is unaligned, do the best you can on this arch" with a minimal overhead.

You cannot tell the compiler. A valid pointer value must always have the same alignment as the the type to which the pointer points, otherwise dereferencing the pointer is UB. Only a pointer variable pointing to a type with alignment=1 (i.e. pointing to void, char, signed char, unsigned char, or to a packed struct) can contain an unaligned value. Consequently, the only thing you can do is choosing a packed struct as pointer target. Note however, that a regular struct and a packed struct with the same members are considered two different types, which do not necessarily have the same size, and which are not assignment compatible with each other.

abyrvalg · « **Reply #13 on:** September 23, 2023, 10:39:26 pm »

More fun, GCC's __attribute__((packed)) seems to be inapplicable to a non-struct pointer type at all: https://godbolt.org/z/MzY6o7vbT
all combinations resulted in ldm r0, {r0, r1, r2}. I should just switch back to Keil, which allows things like uint32_t __packed *ptr

nctnico · « **Reply #14 on:** September 23, 2023, 11:10:40 pm »

Quote from: abyrvalg on September 23, 2023, 10:39:26 pm

More fun, GCC's __attribute__((packed)) seems to be inapplicable to a non-struct pointer type at all: https://godbolt.org/z/MzY6o7vbT
all combinations resulted in ldm r0, {r0, r1, r2}. I should just switch back to Keil, which allows things like uint32_t __packed *ptr

Check the output window. It's says the packed attribute is ignored. IIRC the packed attribute can only be assigned to variables, not to types.

SiliconWizard · « **Reply #15 on:** September 23, 2023, 11:31:51 pm »

Quote from: nctnico on September 23, 2023, 11:10:40 pm

Quote from: abyrvalg on September 23, 2023, 10:39:26 pm
More fun, GCC's __attribute__((packed)) seems to be inapplicable to a non-struct pointer type at all: https://godbolt.org/z/MzY6o7vbT
all combinations resulted in ldm r0, {r0, r1, r2}. I should just switch back to Keil, which allows things like uint32_t __packed *ptr
IIRC the packed attribute can only be assigned to variables, not to types.

No, it can be. But to struct types.
A "packed" uint32_t has no meaning that I know of.
If the intent is to force the compiler to do what it takes to handle a potentially unaligned access, I don't think you can outside the context of a struct.

abyrvalg · « **Reply #16 on:** September 24, 2023, 12:02:25 am »

Quote from: SiliconWizard on September 23, 2023, 11:31:51 pm

If the intent is to force the compiler to do what it takes to handle a potentially unaligned access, I don't think you can outside the context of a struct.

Yes, looks like that’s the only possible packed use with GCC. Found an example of wrapping a single uint32_t into a struct type just to be able to add that attribute. In contrast, both IAR and ARMCC allows adding a __packed prefix to almost any data pointer type, marking the pointer as "possibly unaligned".

brucehoult · « **Reply #17 on:** September 24, 2023, 12:11:45 am »

Quote from: abyrvalg on September 23, 2023, 04:17:50 pm

Although the memcpy approach proposed above solves the alignment problem, it is a trick relying on implicit things, possibly leading to serious overhead (imagine a half-KB sized struct from which we are pulling 3 last ints instead of my shorter example).

You didn't try it, did you?

https://godbolt.org/z/sxKbMa1TE

Code: [Select]

typedef struct
{
    int junk[1000];
    int a, b;
} MyStruct;

void DoTest(unsigned char *src)
{
    MyStruct s;
    memcpy(&s, src, sizeof(s));
    MyFn(s.a, s.b); //Surprise!
}

Gives ...

Code: [Select]

DoTest:
        ldr.w   r2, [r0, #4000]
        ldr.w   r1, [r0, #4004]
        mov     r0, r2
        b       MyFn

And, in this case, if you switch to Cortex M0 it doesn't do the whole memcpy, but does individual byte loads and shifts and adds to load the 8 bytes of interest into register variables. It needs 23 instructions (46 bytes) but that's only going to be half a µs at typical M0 clock speeds.

For sure a lot faster than actually doing a memcpy on a 4K object then throwing most of it away.

If you don't want to trust the compiler *that* much, you can still simply do:

Code: [Select]

void DoTest(unsigned char *src)
{
    int a, b;
    MyStruct *s = (MyStruct*)src; 
    memcpy(&a, &s->a, sizeof(a));
    memcpy(&b, &s->b, sizeof(b));
    MyFn(a, b);
}

... which gives ...

Code: [Select]

DoTest:
        ldr     r1, [r0, #4004]   @ unaligned
        ldr     r0, [r0, #4000]   @ unaligned
        b       MyFn

I'm not seeing the problem here, at least if you're using a standard, sensible, free, easily available compiler such as gcc or clang.

westfw · « **Reply #18 on:** September 24, 2023, 05:23:54 am »

Quote

the memcpy approach

Having to use memcpy and rely on compiler optimizations to not actually do the memcpy, in order to do "type punning", really annoys me.
(frankly, having the compiler replace ANY function call as part of optimization because it "knows" what that function does is annoying. C used to be so "pure" in that sense - a whole language definition with no pre-written functions! Sigh. As someone who spent a long time programming in assembly language, I tend to think in terms of generic pointers - a memory address that can point to anything, and be treated appropriately. And I don't understand why the language people seem so resistant to providing that service. Clearly the memcpy-that-isn't shows that the compiler can defeat optimization issues when it wants to - why can't I have a type-punning mechanism that does that it a more intuitively efficient manner?)

Quote

A valid pointer value must always have the same alignment as the the type to which the pointer points

Surely not? A pointer to a 96byte struct doesn't have to have 96 (or 128) byte alignment...

Siwastaja · « **Reply #19 on:** September 24, 2023, 05:30:39 am »

Yeah, casting struct over a char* buffer is problematic. Choose one of the following strategies:

(1) just don't do it. Use memcpy as nctnico suggests,
(2) only use packed structs when overlaying a struct over buffer. Compilers make no assumptions about alignment and create code which always works, but with performance penalty
(3) know all the features and assumptions built in the CPU and compiler. Code won't be very portable and requires care to write, but if the programmer knows what they are doing, performance and code size could be better than in (2).

Good opening post because it demonstrates, in one go, how (3) fails (because even though CPU supposedly "supports" unaligned access, clearly this support has non-obvious limitations) and (2) succeeds (as it always does for everyone except nctnico). (1) was not tried, but it would obviously work.

gf · « **Reply #20 on:** September 24, 2023, 07:11:07 am »

Quote from: westfw on September 24, 2023, 05:23:54 am

Quote
A valid pointer value must always have the same alignment as the the type to which the pointer points
Surely not? A pointer to a 96byte struct doesn't have to have 96 (or 128) byte alignment...

Alignment of a struct is not a matter of its size, but is the maximum of the alignments of its components. The size of a struct is padded to an integral multiple of its alignment in order that each element is still properly aligned in an array of structs. The declaration

Code: [Select]

typedef struct { int i; double d; } S1;
typedef struct { ...whatsoever... } __attribute__((packed)) S2;
S1* ptr1;

has the following implications

Code: [Select]

_Alignof(S2) == 1    // packed

Code: [Select]

_Alignof(S1) == max(_Alignof(int),_Alignof(double))

Code: [Select]

sizeof(S1) % _Alignof(S1) == 0

and the following condition must be met in order that dereferencing ptr1 is not UB

Code: [Select]

ptr1 != NULL && (uintptr_t)ptr1 % _Alignof(S1) == 0

hans · « **Reply #21 on:** September 24, 2023, 09:19:34 am »

I'm not saying you should use the following (because its still encouraging pointer casts), but it does mitigate the underlying issue on GCC.

GCC assumes that a pointer of MyStruct refers to a storage that is aligned as the object is statically allocated. If its stored in a byte buffer that assumption is invalid. Copying to local variable/struct fixes this allocation by relying on memcpy() dealing with this misalignment.

I've found GCC has the "__builtin_assume_aligned", which can be used as a minimum alignment (which doesn't help, as we need to annotate a maximum alignment of up to 4 bytes as thats the maximum width that can be loaded with unaligned pointers). However, there is also a variant where you can say which byte alignment the struct has on a certain modulo of bytes.
So:

Code: [Select]

void DoTest(unsigned char *src)
{
    MyStruct* pS = (MyStruct*) (src);
    pS = __builtin_assume_aligned(pS, 8, 1);
    MyFn(pS->a, pS->b); //No Surprise!
}

"""Fixes""" this issue by telling GCC that for sure the pointer is aligned to 1 byte offset out of 8.

However, I'm not sure why Clang ignores this annotation and will still go for the ldrd or ldm instruction even though this code is explicitly saying that will cause problems. So it does not seem to be a very stable 'fix', and honestly, I personally wouldn't want to go around my code and telling the compiler which pointers might be crooked.

So I agree with nctnico and bruce. Just use memcpy to make sure the object allocation is sound.
Unfortunately, GCC seem to blindly copy the whole object with memcpy.

Interestingly, when I use in GCC 13.x:

Code: [Select]

typedef struct
{
    int junk[1000];
    int a, b;
} /* __attribute__((packed)) */ MyStruct;

void DoTest(unsigned char *src)
{
    MyStruct s = *(MyStruct*)src;
    MyFn(s.a, s.b); //Surprise!
}

I get:

Code: [Select]

DoTest:
        ldr     r1, [r0, #4004]
        ldr     r0, [r0, #4000]
        b       MyFn

Or have atleast 1 other 'int' before a,b.
But with any preceding field removed, it uses LDM again.
Same behaviour on clang.

gf · « **Reply #22 on:** September 24, 2023, 01:21:26 pm »

Quote from: hans on September 24, 2023, 09:19:34 am

I've found GCC has the "__builtin_assume_aligned", which can be used as a minimum alignment (which doesn't help, as we need to annotate a maximum alignment of up to 4 bytes as thats the maximum width that can be loaded with unaligned pointers). However, there is also a variant where you can say which byte alignment the struct has on a certain modulo of bytes.

My understanding of the doc is, that __builtin_assume_aligned(ptr,a,b) lets GCC assume ((uintptr_t)(ptr)-b)%a==0.
So I wonder I wonder why 4,1 and 8,1 works, and 2,1 does not.
Shouldn't the latter assume (address-1)%2==0, i.e. odd addresses, which are not suitable for ldrd or ldrm either?

Seems that GCC's __builtin_assume_aligned() attaches the given alignment to a value (not to a type or variable), and the property is then propagated through the SSA representation of the function to derived values -- as far as GCC can trace the data flow statically. Since data flow traceability is limited, you can only rely on correct propagation to a limited extent. Check for instance the difference between lines 20 and 21. Both are the same expression, but GCC cannot determine whether p2[idx] still has the same value in line 21, as it was in line 20. Therefore the alignment tag is lost in line 21. Or when MyFn2() is called, the tag ist (silently) discarded, too.

https://godbolt.org/z/ErqdYnPdx

peter-h · « **Reply #23 on:** September 24, 2023, 05:02:16 pm »

You can also configure whether a F4 faults an unaligned access or not, and the compiler won't know what you have selected.

This thread is interesting but confirms that by totally avoiding pointers (except when using the &variable when passing parms to functions, etc) I am saving myself a load of hassle

Can you imagine somebody picking up your code a few years later? I sometimes include EEVBLOG thread URLs in my comments, and much more often in project documentation, but still...

DiTBho · « **Reply #24 on:** September 25, 2023, 04:29:58 am »

Quote from: gf on September 24, 2023, 07:11:07 am

Alignment of a struct is not a matter of its size, but is the maximum of the alignments of its components. The size of a struct is padded to an integral multiple of its alignment in order that each element is still properly aligned in an array of structs. The declaration

Code: [Select]
typedef struct { int i; double d; } S1; typedef struct { ...whatsoever... } __attribute__((packed)) S2; S1* ptr1;
has the following implications

Code: [Select]
_Alignof(S2) == 1 // packed
Code: [Select]
_Alignof(S1) == max(_Alignof(int),_Alignof(double))
Code: [Select]
sizeof(S1) % _Alignof(S1) == 0
and the following condition must be met in order that dereferencing ptr1 is not UB

Code: [Select]
ptr1 != NULL && (uintptr_t)ptr1 % _Alignof(S1) == 0

this is also what my-c does, and that condition must be met, otherwise (for UB avoidance) it refuses to compile.

DiTBho · « **Reply #25 on:** September 25, 2023, 04:38:52 am »

Quote from: westfw on September 24, 2023, 05:23:54 am

frankly, having the compiler replace ANY function call as part of optimization because it "knows" what that function does is annoying. C used to be so "pure" in that sense - a whole language definition with no pre-written functions! Sigh. As someone who spent a long time programming in assembly language, I tend to think in terms of generic pointers - a memory address that can point to anything, and be treated appropriately. And I don't understand why the language people seem so resistant to providing that service. Clearly the memcpy-that-isn't shows that the compiler can defeat optimization issues when it wants to - why can't I have a type-punning mechanism that does that it a more intuitively efficient manner?

that's one reason i wrote my own c-like compiler: tired of that stuff, tired of asking for it not to be done and being ignored.

gf · « **Reply #26 on:** September 25, 2023, 08:17:26 am »

Quote from: DiTBho on September 25, 2023, 04:29:58 am

Quote from: gf on September 24, 2023, 07:11:07 am
and the following condition must be met in order that dereferencing ptr1 is not UB
Code: [Select]
ptr1 != NULL && (uintptr_t)ptr1 % _Alignof(S1) == 0

this is also what my-c does, and that condition must be met, otherwise (for UB avoidance) it refuses to compile.

The problem is, as long as reinterpret casts are allowed, you can only check that at compile time if you can track the data flow statically to the origin of the pointer value. In many (most?) cases you cannot.

Jeroen3 · « **Reply #27 on:** September 25, 2023, 08:37:45 am »

This topic is the a nice summary of why C is both awesome and awful at the same time.

brucehoult · « **Reply #28 on:** September 25, 2023, 08:48:12 am »

Quote from: Jeroen3 on September 25, 2023, 08:37:45 am

This topic is the a nice summary of why C is both awesome and awful at the same time.

C is not perfect by any means, but it has evolved and been battle-hardened for 50 years to be the best option for software that should be at the same time portable to everything and also high performance.

DiTBho · « **Reply #29 on:** September 25, 2023, 01:45:27 pm »

Quote from: gf on September 25, 2023, 08:17:26 am

Quote from: DiTBho on September 25, 2023, 04:29:58 am
Quote from: gf on September 24, 2023, 07:11:07 am
and the following condition must be met in order that dereferencing ptr1 is not UB
Code: [Select]
ptr1 != NULL && (uintptr_t)ptr1 % _Alignof(S1) == 0

this is also what my-c does, and that condition must be met, otherwise (for UB avoidance) it refuses to compile.

The problem is, as long as reinterpret casts are allowed, you can only check that at compile time if you can track the data flow statically to the origin of the pointer value. In many (most?) cases you cannot.

That's why casting is not allowed in my-c and you are always forced to define a datatype with its explicit alignment.

It may seem limiting, but in the long run it helps to make fewer messes.

mikerj · « **Reply #30 on:** September 25, 2023, 04:10:30 pm »

Quote from: peter-h on September 24, 2023, 05:02:16 pm

by totally avoiding pointers (except when using the &variable when passing parms to functions, etc) I am saving myself a load of hassle

Totally avoid using any pointers except for when you use a pointer, got it

Tbh you'd have to jump through a lot of hoops to not use pointers on any reasonably complex program, and doing so is likely to make it more complex and very probably slower.

SiliconWizard · « **Reply #31 on:** September 25, 2023, 09:34:30 pm »

Quote from: Jeroen3 on September 25, 2023, 08:37:45 am

This topic is the a nice summary of why C is both awesome and awful at the same time.

Given that a significant chunk of this thread is about non-standard C (like using compiler-specific attributes to twist UB), everything goes.
So maybe one could change your statement to "the reason why C is both awesome and awful at the same time is that compilers add various extensions that make it an easily moving target unless one strictly sticks to the standard".
(In which case, people realize that C is more portable, but a lot less flexible than what they thought.)

As for me, I think the reason C is "awesome" was summed up by brucehoult above, and the reason it is "awful" is mainly that it's one of the least well taught programming languages of all times.

Siwastaja · « **Reply #32 on:** September 26, 2023, 07:55:17 am »

Quote from: mikerj on September 25, 2023, 04:10:30 pm

Totally avoid using any pointers except for when you use a pointer, got it

This is a repeating discussion with peter-h.

"I don't use any pointers, at all, anywhere"
"How can you do anything at all then? Your code must be complete mess."
"Oh, I just use pointers to do this and that."

I agree with his idea of avoiding excess or "too clever" pointer use when not truly necessary, but I don't understand why it has to be represented this way. C programming without pointers at all is, if not impossible, painful enough. A pointer is truly useful feature in computers and in heavy use starting from assembly, and nearly all programming languages have pointers, or "pointers" under a different name (possibly with more advanced features providing safety, but still fundamentally pointers).

Siwastaja · « **Reply #33 on:** September 26, 2023, 08:04:52 am »

Quote from: SiliconWizard on September 25, 2023, 09:34:30 pm

and the reason it is "awful" is mainly that it's one of the least well taught programming languages of all times.

Bingo. I went to uni in 2005 and programming courses were basically this:

"Language does not matter at all. We are not teaching languages but concepts. We could use any synthetic language."
what utter bullshit! Of course it pays off to teach languages. We also had natural language (English, Swedish) courses. Math courses taught the math notation, not just "concepts", or not a synthetic separate "for teaching" math notation.

"C is obsolete, not worth teaching"
still bullshit, even more bullshit in 2005. Think about GNU and the linux kernel, what a huge ecosystem, mostly in C, not to mention embedded.

"C is difficult and error-prone"
Maybe this could be addressed by... drumroll... teaching? >90% of the "error-proniness" of C is following some weird practices copypasted from example code written in 1980's to early 1990's (e.g., random and unnecessary use of memcpy/memset, unnecessary type punning via pointers), which can be easily replaced with more robust and modern practices (e.g., use normal high level language features present in C since C89, like assignment operator).

"C++ is much easier and so much better in design, so we spend month after month teaching the weird intricacies of C++ instead of the general language-agnostic concepts we advertised at the beginning"
what utter bullshit.

Specifically I'm intrigued by the fact how teachers and other pedants give zero value to the fact that C is, in fact, a well standardized language, unlike some random hackjobs like Python where "anything goes" and every version is incompatible with other.

But likely the biggest disgrace is that C ain't object oriented. That was a big no-no in academy starting from late 1990's.

westfw · « **Reply #34 on:** September 26, 2023, 05:18:05 pm »

Quote

Given that a significant chunk of this thread is about non-standard C (like using compiler-specific attributes to twist UB)

One (Another!) of my complaints is the apparent race to specify certain "used for a long time" coding patterns as "undefined behavior."
It's not like people weren't using unions to accomplish type punning for decades before it was "outlawed" for reasons I don't really understand.

nctnico · « **Reply #35 on:** September 26, 2023, 05:25:45 pm »

Quote from: westfw on September 26, 2023, 05:18:05 pm

Quote
Given that a significant chunk of this thread is about non-standard C (like using compiler-specific attributes to twist UB)

One (Another!) of my complaints is the apparent race to specify certain "used for a long time" coding patterns as "undefined behavior."
It's not like people weren't using unions to accomplish type punning for decades before it was "outlawed" for reasons I don't really understand.

Maybe it has to do with the fact that most platforms C was running on where either 8 bit or x86 in the old days. In both cases you won't have alignment issues. Just a speed penalty. Nowadays the 16, 32 and 64 bit platforms are everywhere and many are less forgiving where it comes to alignment. So I kind of understand why type-punning is considered to be bad.

But there are also stupid things that get broken. Yesterday I spend a couple of hours to get gcc 4.9.x to compile on a recent Debian install ending up needing to patch a few files to change some attributes.

gf · « **Reply #36 on:** September 26, 2023, 05:43:58 pm »

Quote from: westfw on September 26, 2023, 05:18:05 pm

One (Another!) of my complaints is the apparent race to specify certain "used for a long time" coding patterns as "undefined behavior."
It's not like people weren't using unions to accomplish type punning for decades before it was "outlawed" for reasons I don't really understand.

Some rules like e.g. strict aliasing were introduced to help compilers to do certain optimizations that are not possible with the traditional semantics. The compiler writers are happy, and the programmers are rather confused

ejeffrey · « **Reply #37 on:** September 26, 2023, 06:25:26 pm »

Quote from: westfw on September 26, 2023, 05:18:05 pm

Quote
Given that a significant chunk of this thread is about non-standard C (like using compiler-specific attributes to twist UB)

One (Another!) of my complaints is the apparent race to specify certain "used for a long time" coding patterns as "undefined behavior."
It's not like people weren't using unions to accomplish type punning for decades before it was "outlawed" for reasons I don't really understand.

Ok but that "race" was over two decades ago.

And it's an incredibly powerful optimization tool. Without it, the compiler basically has to assume almost any two pointers could point to the same memory and must reload from memory into registers after every store. It was probably the leading reason C was slower than Fortran which had much stricter aliasing rules.

peter-h · « **Reply #38 on:** September 27, 2023, 04:37:46 pm »

Quote

"I don't use any pointers, at all, anywhere"
"How can you do anything at all then? Your code must be complete mess."
"Oh, I just use pointers to do this and that."

I never said the above and all here know exactly what I mean.

DiTBho · « **Reply #39 on:** September 28, 2023, 12:47:06 pm »

Quote from: nctnico on September 26, 2023, 05:25:45 pm

Maybe it has to do with the fact that most platforms C was running on where either 8 bit or x86 in the old days. In both cases you won't have alignment issues. Just a speed penalty.

Yes, in 1994, and specifically for my IDT MIPS board, I see from the notes on my paper manuals a few mentions of collaborations between Algorithmics and IDT to provide patches on Gcc to support their MIPS platforms.

It seems that many of these patches were for alignment and things that were problematic neither on x86 non or m68k, so "extra work" out of the ufficial gcc branch until they were finally in-tree merged.

Nominal Animal · « **Reply #40 on:** September 28, 2023, 01:44:24 pm »

Quote from: westfw on September 26, 2023, 05:18:05 pm

It's not like people weren't using unions to accomplish type punning for decades before it was "outlawed" for reasons I don't really understand.

Type punning via unions is in standard C, and works as you expect it to. It is type punning via pointers that is unreliable/"outlawed".

But yes. In mid-to-late nineties the C standard codified the common ground among C compilers, up to C99.
Then it shifted into lets-dictate-what-compiler-writers-need-to-implement mode, including all the Annex K nonsense by MS, and MS' continuing refusal to acknowledge/support C99.

It is only in the latest few years (let's say since 2015 or so) that there has been a shift back to the nineties track that did work, with C17 being just corrections and clarifications to the weird misstep that was C11, but C23 will show whether we're on a better track now, back to codifying what has already been found to work in practice, instead of trying to dictate how compilers and users should use C.

abyrvalg · « **Reply #41 on:** September 28, 2023, 03:04:55 pm »

I found myself being afraid to increase -std=xx number w/o strong reasons - no hope to gain something significant, but fear to break something silently. In contrary, upgrading Python looks like fun

DiTBho · « **Reply #42 on:** September 28, 2023, 06:15:16 pm »

Quote from: abyrvalg on September 28, 2023, 03:04:55 pm

upgrading Python looks like fun

On Gentoo/Linux, the Portage system is based on Python, and upgrading Python, as well as Perl, can be catastrophic, meaning the entire system could stop working and explode into broken parts that cannot no longer compile anything, making your userland as useful up as a solid brick.

abyrvalg · « **Reply #43 on:** September 28, 2023, 06:50:15 pm »

No no, not the system one of course

Nominal Animal · « **Reply #44 on:** September 29, 2023, 09:02:30 am »

Quote from: abyrvalg on September 28, 2023, 03:04:55 pm

I found myself being afraid to increase -std=xx number w/o strong reasons - no hope to gain something significant, but fear to break something silently.

Yup, it is one of those things that are, and have to be, designed in from the get go; and not meddled with without a good reason or analysis/verification afterwards.

Most useful things the C compilers can provide tend to be provided via extensions anyway, before being codified as part of standard C anyway.

For example, most C compilers now support at least the two-parameter form of __builtin_assume_aligned(ptr_expr, minimum_alignment), which helps with the opposite situation compared to the one at hand. It seems that the inverse, __builtin_assume_unaligned(ptr_expr), would be really useful here. Unfortunately, none such exists, as far as I know.

SiliconWizard · « **Reply #45 on:** September 30, 2023, 01:58:37 am »

Quote from: abyrvalg on September 28, 2023, 03:04:55 pm

I found myself being afraid to increase -std=xx number w/o strong reasons - no hope to gain something significant, but fear to break something silently.

While the successive C standard revisions have a rather good level of backward compatibility, at least compared to many other languages, it makes no sense per se to compile a given code base with a more recent revision than what it was coded against. This is not how standards work. As a general rule, restrict the std revision of your compiler to the revision the code was meant to be compliant with, nothing else.

There is usually and by nature nothing to gain by forcing a compiler to compile code that was written for an older std revision as a newer std revision. That doesn't make sense.

OTOH, if you really write code compliant with a newer revision and use a useful feature that wasn't present in older revisions, then it can definitely be useful and should certainly be considered.
Just do not throw a random C std rev at your code and see what sticks.

Quote from: abyrvalg on September 28, 2023, 03:04:55 pm

In contrary, upgrading Python looks like fun

Python is the last thing I would have cited as an example of painless transition for the same code base to a newer version of the language. Odd. But you didn't say painless, you said "fun". So, whatever's fun.

bson · « **Reply #46 on:** October 02, 2023, 07:43:45 pm »

This is actually a real problem, thanks for pointing it out. I would categorize it as a compiler bug or problem that could be solved with an alignment attribute.

memcpy/memmove is a total non-solution in many cases. For example if you receive a frame from an ethernet controller it's going to be 16-bit aligned because the frame header is 14 bytes. IP and TCP/UDP/ICMP headers then contain multiple 32-bit fields that are unaligned, but also smaller fields that might result in ldm/stm instructions. Needless to say, a memmove to align all ethernet traffic is a non-starter on such a small processor. A workaround can be to 16-bit align DMA addresses for transfers, but this may not always be possible. Some past MIPS32 SoC Ethernet controllers were notorious in this regard and could only do transfers to 256-byte aligned addresses, which misaligned protocol headers, resulting in millions of emulation alignment traps per second and abysmal network performance. (Disabling unaligned access emulation would cause the kernel to panic during boot when it tries to create the first interface.) Adding memmoves to the driver was the workaround for those. But that fail was hardware-induced... there's no reason a compiler should be causing the same.

I think this might actually be the cause of some mysterious bugs I've seen in the past, that just randomly disappeared when I've debugged or instrumented code in vain attempts to hunt them down...

Nominal Animal · « **Reply #47 on:** October 03, 2023, 07:25:10 pm »

One simple alternative is to use inline accessor functions. Compiler Explorer example for Cortex-M3.

Because of the various issues with packed and/or unaligned structures, I do tend to use such accessors instead of structures, when transferring information. (I do use packed structures, often with anonymous unions as in the above example, for example for type punning and to simplify access to memory-mapped peripherals. In other words, I'm not saying one thing is bad and another is good, I'm only describing my preferred tool choices here.)

Because pointer arithmetic is not well defined for void pointers, I find it easiest to use an unsigned char pointer to the beginning of the buffer, even though the inline accessor functions do take a void pointer, i.e. their interface is basically type get_type(const void *const ptr), or occasionally type get_type(const void *const ptr, const unsigned char byte_order_change), where each bit i in byte_order_change corresponds to swapping 2ⁱ-byte groups, allowing run-time adaptation to different byte orders (in files or messages) based on known prototype multibyte values. (For each type requiting 2^k bytes of storage, there are also 2^k possible byte orders, even though two –– in order of ascending or descending mathematical significance, i.e. little or big endian –– are normally encountered.)

(Note that in C function declarations, static ≡ static inline. I personally have the habit of marking internal/local functions as static, and accessor functions as static inline, that's all. It helps in code maintenance, as it helps me structure the code better, and reduces the cognitive load with the ubiquitous "hey, what does this function do?" moments.)

I suspect, but have not verified, that

Code: [Select]

typedef struct {
    union {
        uint16_t  u16;
        int16_t   i16;
    };
} __attribute__((__packed__)) unaligned_16;

typedef struct {
    union {
        uint32_t  u32;
        int32_t   i32;
        float     f32;
    };
} __attribute__((__packed__)) unaligned_32;

typedef struct {
    union {
        uint32_t  u64;
        int32_t   i64;
        double     f64;
    };
} __attribute__((__packed__)) unaligned_64;

on current C compilers does yield 16-, 32-, and 64-bit number types one can use for unaligned accesses safely, with the extra "cost" of having to type (normal reference to structure member).i32 or whatever subtype one wants.

In other words, instead of using a packed structure with standard types, you can use a structure with unaligned members, defined as packed structures with effectively just one member. (I need type punning between types of the same size so often that I prefer to bake that in to these 'single effective member packed structures', as they cost nothing. Even if you used a single member, you'd still need to refer to that member; the anonymous union just lets you type-pun at that time if you wish.)

(EDITED to clarify: the above still requires global and static structures to be accessed via a temporary pointer. Just declaring static or global structures with aforementioned unaligned_nn type members does not stop the compiler from combining accesses to their members; as the static or global declaration effectively "neutralizes" the packed attribute.)

With C11 _Generic, this could be extended to a macro that expands to an accessor call ensuring exactly one unaligned-safe access is done to the parameter object, i.e.
typedef struct { int a, b; } MyStruct;
MyStruct *src;
MyFn(UNALIGNED_DEREF_ONCE(&(src->a)), UNALIGNED_DEREF_ONCE(&(src->b)));
although folding the address-of into the macro would shorten the code (but "violates" the normal passing-by-value logic in C):
MyFn(UNALIGNED_ONCE(src->a), UNALIGNED_ONCE(src->b));
The idea is that _Generic is used to choose the size, byte order, and return type, by choosing a specific inline accessor function. The inline accessor functions take a pointer to the object whose value they need to return. For native byte order types, they cast that pointer to a (volatile) pointer to the suitable packed structure type, and obtain the value.

As to why the "exactly one access", that (in conjuction with the C abstract machine rules) should ensure multiple unaligned accesses are not folded into a single access. You ensure that (as much as one can in C, without resorting to assembly) by having the accessor function use a pointer with volatile type to the anonymous union structure type. The reason for it working is that the parameter to the accessor function is not volatile itself. This behaviour is not dictated by the C standard in any way, and is just a consequence of how current compilers implement the abstract machine described in the C standard, and specifically how they perform optimizations (especially at the abstract syntax tree level).

For non-native byte order types with "exactly one access", volatile is replaced by a memory copy to an array of unsigned char (or a structure with/or union that contains the desired type and an array of unsigned char); the unsigned char array is permuted according to the byte order change, and then the resulting type is returned via type-punned union access.

In any case, personally prefer a nested structure of single-member unaligned/packed structures instead, i.e.
typedef struct { unaligned_32 a, b; } MyStruct;
MyStruct *src;
MyFn((src->a).i32, (src->b).i32);
noting that whether MyStruct is declared packed or not is irrelevant here; it is the members that are declared "unaligned". This is as close to *(type)__builtin_unaligned(pointer) I can currently get. (EDITED to add: the use of a pointer for the access is mandatory. If you declare MyStruct foo; then calling MyFn(foo.a.i32, foo.b.i32) may lead to combined loads. To fix, use const volatile MyStruct *src = &foo; and MyFn((src->a).i32, (src->b).i32);.)

Experimenting with this on Compiler Explorer suggests this works for all the compilers that it supports –– although I only tested a few examples I personally care about, and would thus appreciate a heads-up if anyone finds a counterexample (an architecture where the above, or the test example, fails to generate safe-unaligned-access code). The only non-standard feature needed is the packed attribute support.

bson · « **Reply #48 on:** October 03, 2023, 10:32:02 pm »

I don't see how this will help you at all. ~~An optimizing compiler~~ gcc can still combine multiple 32-bit loads and stores into ldm/stm.

Nominal Animal · « **Reply #49 on:** October 04, 2023, 12:04:41 am »

Quote from: bson on October 03, 2023, 10:32:02 pm

I don't see how this will help you at all. ~~An optimizing compiler~~ gcc can still combine multiple 32-bit loads and stores into ldm/stm.

Did you actually verify that? I am claiming gcc (and gcc-compatible compilers like clang and sdcc) does not combine multiple 32-bit loads into ldm if you use the code patterns I showed, and generates 'ldr rn, [rm] @ unaligned' and 'ldr rn, [rm, offset] @ unaligned' (i.e., unaligned-safe loads on Cortex-M3) instead.

gf · « **Reply #50 on:** October 04, 2023, 01:59:05 pm »

Quote from: Nominal Animal on October 04, 2023, 12:04:41 am

Quote from: bson on October 03, 2023, 10:32:02 pm
I don't see how this will help you at all. ~~An optimizing compiler~~ gcc can still combine multiple 32-bit loads and stores into ldm/stm.
Did you actually verify that? I am claiming gcc (and gcc-compatible compilers like clang and sdcc) does not combine multiple 32-bit loads into ldm if you use the code patterns I showed, and generates 'ldr rn, [rm] @ unaligned' and 'ldr rn, [rm, offset] @ unaligned' (i.e., unaligned-safe loads on Cortex-M3) instead.

It depends on the context, particularly on what the compiler happens to know about the addresses from data flow analysis.

Example: https://godbolt.org/z/b71o8Ks9K

Here, the compiler knows that it places buffer[] at a >= 4-byte aligned address (it is the first object in .bss, and .bss is even 8-byte aligned). Therefore, in test1(), the compiler can still (safely) combine the two loads at buffer+16 and buffer+20 using "ldrd", knowing that these addresses are also properly aligned, even though the two int32_t objects are accessed as members of a packed struct.

However, in test2(), the compiler does not have information about the address stored in ptr. As a result, this optimization is not possible, leading to unaligned "ldr" instructions being emitted (due to accessing the two objects as members of a packed struct).

Nominal Animal · « **Reply #51 on:** October 04, 2023, 02:43:32 pm »

Quote from: gf on October 04, 2023, 01:59:05 pm

It depends on the context, particularly on what the compiler happens to know about the addresses from data flow analysis.

Example: https://godbolt.org/z/b71o8Ks9K

I was unclear: you need the exactly once approach to avoid load combining on Cortex-M3, and you get that using a temporary (differently-qualified or different-type) pointer to the data, that ensures exact individual loads with current gcc and clang optimization engines:

Code: [Select]

int32_t test1(void)
{
    const volatile S *const ref = buffer;
    return (ref+16)->i32 +
           (ref+20)->i32;
}

With trunk gcc or clang on Cortex-M3, -Os gives you unaligned individual loads, and -O2 and -O3 individual aligned loads.

(Note: volatile may not be necessary in all cases. It is the type-cast pointer use that acts as the barrier to load merging, and volatile is the logically correct qualifier for the access here ("compiler, do not cache the value of this access"). As there is type or qualifier difference between the original reference and the reference used for the access, the optimization engines will not ignore the temporary pointer, and will thus use the "rules" set by the temporary pointer for the access.)

Essentially, rewriting your example code using my approach, you get

Code: [Select]

#include <stdint.h>

typedef struct {
    int32_t i32;
} __attribute__((__packed__)) S;

unsigned char buffer[64];

int32_t test1(void)
{
    const volatile S *const src = buffer;
    return (src+16)->i32 +
           (src+20)->i32;
}

int32_t test2(const unsigned char *buf)
{
    const volatile S *const src = buf;
    return (src+16)->i32 +
           (src+20)->i32;
}

which compiles to, depending on the compiler and optimization level, to essentially

Code: [Select]

test1:
    ldr     r3, .L3
    ldr     r0, [r3, #64]
    ldr     r3, [r3, #80]
    add     r0, r0, r3
    bx      lr

.L3:
    .word   .LANCHOR0

test2:
    ldr     r2, [r0, #64]     @ unaligned
    ldr     r0, [r0, #80]     @ unaligned
    add     r0, r0, r2
    bx      lr

buffer:

with the second and third lines/ldr instructions sometimes (depending on optimization) using the @ unaligned unaligned-safe instruction variant (as the compiler chooses the location of buffer, technically those two loads are never unaligned); but never combining the loads.

On OS ABI versions (you can run e.g. Linux on Cortex-M3), the buffer address is typically relative to PC, so the buffer address calculation will differ on those; but the actual loads will still be via individual ldr instructions, instead of a combined one.

I added an EDITED paragraph and an EDITED sentence to my post, to hopefully clarify this. Let me know if any of you object, or find a case where this pattern fails.

gf · « **Reply #52 on:** October 04, 2023, 04:47:52 pm »

Quote from: Nominal Animal on October 04, 2023, 02:43:32 pm

I was unclear: you need the exactly once approach to avoid load combining on Cortex-M3, and you get that using a temporary pointer to (volatile) data

Completely agree, accessing the i32 member via a volatile S* pointer has of course volatile semantics, preventing load combining.

Btw, your version of test1

Code: [Select]

int32_t test1(void)
{
    const volatile S * const ref = buffer;
    return (ref+16)->i32 + (ref+20)->i32;
}

cannot combine the loads at all, because the two objects are not adjacent, but have an offset of 16 bytes. So you always get indiviual ldr instructions, both with and without volatile.

Quote from: Nominal Animal on October 04, 2023, 02:43:32 pm

Note: volatile may not be necessary in all cases. It is the type-cast pointer use that acts as the barrier to load merging...

A type-cast to a non-volatile pointer obviously does not prevent load merging: https://godbolt.org/z/dh8xb8Y81
Why should it?

[ ~~Yet another reason to renounce load merging is, of course, if the compiler cannot prove adjacency and proper alignment of the addresses at compile time. If the compiler is unsure, it must not merge.~~
EDIT: Yet another reason to renounce load merging is, of course, if the compiler cannot prove adjacency or if it can disprove proper alignment of the addresses at compile time. ]

EDIT: A store can, of course, also act as a load merging barrier (-> strict aliasing): https://godbolt.org/z/Pb4GMTdsY

Nominal Animal · « **Reply #53 on:** October 04, 2023, 06:32:56 pm »

Quote from: gf on October 04, 2023, 04:47:52 pm

Btw, your version of test1 cannot combine the loads at all, because the two objects are not adjacent, but have an offset of 16 bytes. So you always get indiviual ldr instructions, both with and without volatile.

D'oh! True. To fix, replace (src+16) and (src+20) with (src+4) and (src+5).

This has no effect on my argument, as no combining will occur with the fix applied, either.

Quote from: gf on October 04, 2023, 04:47:52 pm

Quote from: Nominal Animal on October 04, 2023, 02:43:32 pm
Note: volatile may not be necessary in all cases. It is the type-cast pointer use that acts as the barrier to load merging...
A type-cast to a non-volatile pointer obviously does not prevent load merging: https://godbolt.org/z/dh8xb8Y81

Load merging will not occur if the buffer data is volatile, ie. when you have volatile char buffer[64];, nor when the pointer address is based on a char pointer whose value is unknown at compile time, ie. int32_t test1(char *buffer).

volatile is the correct qualifier to use in all cases, but one should not be surprised if in certain cases the compiler will generate unaligned non-merged accesses even when volatile is omitted. (Those certain cases are when the compiler cannot make any assumptions about the address being referenced, and cannot cache or use a cached value at that address, because there is no reuse.)

Quote from: gf on October 04, 2023, 04:47:52 pm

[ Yet another reason to renounce load merging is, of course, if the compiler cannot prove adjacency and proper alignment of the addresses at compile time. If the compiler is unsure, it must not merge. ]

True. Here, the sequence point rules (specifically, order and separateness of accesses to volatile-qualified objects) are even more important, because they stop the compiler from merging even when it can prove adjacency and alignment at compile time.

Hmm.. this should mean that by declaring the unaligned_nn member(s) volatile would achieve the same; and quick testing implies this is the case.

Therefore, for unaligned_nn access structures, perhaps I should suggest something like the following, instead of the ones I showed earlier?

Code: [Select]

#include <stdint.h>

typedef struct {
    union {
        volatile uint16_t    u16;
        volatile int16_t     i16;
    };
} __attribute__((__packed__)) unaligned_16;

typedef struct {
    union {
        volatile uint32_t    u32;
        volatile int32_t     i32;
        volatile float       f32;
    };
} __attribute__((__packed__)) unaligned_32;

typedef struct {
    union {
        volatile uint64_t    u64;
        volatile int64_t     i64;
        volatile double      f64;
    };
} __attribute__((__packed__)) unaligned_64;

It generates separate (unaligned-allowed) loads even for
typedef struct { unaligned_32 a, b; } MyStruct;
MyStruct example;
int32_t test(void) { return example.a.i32 + example.b.i32; }
which I assume can be used as a litmus test here.

(I have not thought of any real downsides for the volatile-ness. If the same unaligned structure field is used multiple times, I always recommended to use an explicit temporary local variable to hold the value, because optimizing away local temporary variables is a very common practice so the compiler is very good at it, but optimizing multiple accesses to the same member of a structure takes more optimization smarts; better code is generated when one uses an explicit temporary local variable. Plus, if one uses a logically descriptive name for the temporary variable, even the code tends to be easier to maintain that way.)

gf · « **Reply #54 on:** October 04, 2023, 09:18:46 pm »

Quote from: Nominal Animal on October 04, 2023, 06:32:56 pm

Load merging will not occur if the buffer data is volatile, ie. when you have volatile char buffer[64];

Indeed, gcc apparently propagates volatile as a hidden attribute of values from buffer to derived values, so that the load becomes volatile at the end as well, even if the temporary pointer declaration does not contain a volatile keyword. However, I think this kind of propagation not guaranteed by the C standard, so a portable program should possibly not rely on it.

Quote

volatile is the correct qualifier to use in all cases, but one should not be surprised if in certain cases the compiler will generate unaligned non-merged accesses even when volatile is omitted.

Sure, there are various other possible reasons that have nothing to do with volatile.

Quote

Therefore, for unaligned_nn access structures, perhaps I should suggest something like the following, instead of the ones I showed earlier?

It is certainly a way to achieve that any load/store from/to these members has volatile semantics.

Nominal Animal · « **Reply #55 on:** October 05, 2023, 06:44:49 pm »

Quote from: gf on October 04, 2023, 09:18:46 pm

I think this kind of propagation not guaranteed by the C standard, so a portable program should possibly not rely on it.

As even the packed attribute is a C extension, and not part of the C standard, none of this is "guaranteed"; and it basically depends on current compilers' optimization strategies. (I use one of -O2, -Os, -Oz (clang), or -Og, plus any specific optimization featuress I happen to need.)
For GCC use, I would personally add a note in the build documentation about relying on this, and a simple example program for one to verify the machine code generated, when changing GCC major version, or between GCC and clang, for example. Even when I trust, I do try to keep verification easy. Litmus tests with simple functions that generate easily determined machine code patterns are good for this, in my opinion.

As mentioned previously, all of this would be easily solved by a __builtin_unaligned(pointer_expression) built-in, or even an unaligned C qualifier keyword.

I do need unaligned accesses most often when transferring information between other hosts or devices (microcontrollers or computers), and there the accessor function approach with specific endianness does tend to generate near-optimal code. (See this post in a previous thread about 'packed' attribute for a code example.)

For the macro approach –– if that is preferable over the accessor function approach ––, considering the very useful feedback from gf and others, I would suggest something along the following:

Code: [Select]

// SPDX-License-specifier: CC0-1.0
#ifndef  UNALIGNED
#define  UNALIGNED(ptr)  ((_Generic(                             (ptr)                                                   \
                                   ,                unsigned long long *: unaligned_unsigned_long_long_at                \
                                   ,          const unsigned long long *: unaligned_const_unsigned_long_long_at          \
                                   ,       volatile unsigned long long *: unaligned_volatile_unsigned_long_long_at       \
                                   , const volatile unsigned long long *: unaligned_const_volatile_unsigned_long_long_at \
                                   ,                         long long *: unaligned_long_long_at                         \
                                   ,                   const long long *: unaligned_const_long_long_at                   \
                                   ,                volatile long long *: unaligned_volatile_long_long_at                \
                                   ,          const volatile long long *: unaligned_const_volatile_long_long_at          \
                                   ,                     unsigned long *: unaligned_unsigned_long_at                     \
                                   ,               const unsigned long *: unaligned_const_unsigned_long_at               \
                                   ,            volatile unsigned long *: unaligned_volatile_unsigned_long_at            \
                                   ,      const volatile unsigned long *: unaligned_const_volatile_unsigned_long_at      \
                                   ,                              long *: unaligned_long_at                              \
                                   ,                        const long *: unaligned_const_long_at                        \
                                   ,                     volatile long *: unaligned_volatile_long_at                     \
                                   ,               const volatile long *: unaligned_const_volatile_long_at               \
                                   ,                      unsigned int *: unaligned_unsigned_int_at                      \
                                   ,                const unsigned int *: unaligned_const_unsigned_int_at                \
                                   ,             volatile unsigned int *: unaligned_volatile_unsigned_int_at             \
                                   ,       const volatile unsigned int *: unaligned_const_volatile_unsigned_int_at       \
                                   ,                               int *: unaligned_int_at                               \
                                   ,                         const int *: unaligned_const_int_at                         \
                                   ,                      volatile int *: unaligned_volatile_int_at                      \
                                   ,                const volatile int *: unaligned_const_volatile_int_at                \
                                   ,                    unsigned short *: unaligned_unsigned_short_at                    \
                                   ,              const unsigned short *: unaligned_const_unsigned_short_at              \
                                   ,           volatile unsigned short *: unaligned_volatile_unsigned_short_at           \
                                   ,     const volatile unsigned short *: unaligned_const_volatile_unsigned_short_at     \
                                   ,                             short *: unaligned_short_at                             \
                                   ,                       const short *: unaligned_const_short_at                       \
                                   ,                    volatile short *: unaligned_volatile_short_at                    \
                                   ,              const volatile short *: unaligned_const_volatile_short_at              \
                                   ,                     unsigned char *: unaligned_unsigned_char_at                     \
                                   ,               const unsigned char *: unaligned_const_unsigned_char_at               \
                                   ,            volatile unsigned char *: unaligned_volatile_unsigned_char_at            \
                                   ,      const volatile unsigned char *: unaligned_const_volatile_unsigned_char_at      \
                                   ,                       signed char *: unaligned_signed_char_at                       \
                                   ,                 const signed char *: unaligned_const_signed_char_at                 \
                                   ,              volatile signed char *: unaligned_volatile_signed_char_at              \
                                   ,        const volatile signed char *: unaligned_const_volatile_signed_char_at        \
                                   ,                              char *: unaligned_char_at                              \
                                   ,                        const char *: unaligned_const_char_at                        \
                                   ,                     volatile char *: unaligned_volatile_char_at                     \
                                   ,               const volatile char *: unaligned_const_volatile_char_at               \
                                   )(ptr))->value)

#define  PACKED_STRUCT(type)  struct { volatile type value; } __attribute__((__packed__))

typedef  PACKED_STRUCT(const unsigned long long)  unaligned_const_unsigned_long_long_struct;
typedef  PACKED_STRUCT(unsigned long long)        unaligned_unsigned_long_long_struct;
typedef  PACKED_STRUCT(const long long)           unaligned_const_long_long_struct;
typedef  PACKED_STRUCT(long long)                 unaligned_long_long_struct;
typedef  PACKED_STRUCT(const unsigned long)       unaligned_const_unsigned_long_struct;
typedef  PACKED_STRUCT(unsigned long)             unaligned_unsigned_long_struct;
typedef  PACKED_STRUCT(const long)                unaligned_const_long_struct;
typedef  PACKED_STRUCT(long)                      unaligned_long_struct;
typedef  PACKED_STRUCT(const unsigned int)        unaligned_const_unsigned_int_struct;
typedef  PACKED_STRUCT(unsigned int)              unaligned_unsigned_int_struct;
typedef  PACKED_STRUCT(const int)                 unaligned_const_int_struct;
typedef  PACKED_STRUCT(int)                       unaligned_int_struct;
typedef  PACKED_STRUCT(const unsigned short)      unaligned_const_unsigned_short_struct;
typedef  PACKED_STRUCT(unsigned short)            unaligned_unsigned_short_struct;
typedef  PACKED_STRUCT(const short)               unaligned_const_short_struct;
typedef  PACKED_STRUCT(short)                     unaligned_short_struct;
typedef  PACKED_STRUCT(const unsigned char)       unaligned_const_unsigned_char_struct;
typedef  PACKED_STRUCT(unsigned char)             unaligned_unsigned_char_struct;
typedef  PACKED_STRUCT(const signed char)         unaligned_const_signed_char_struct;
typedef  PACKED_STRUCT(signed char)               unaligned_signed_char_struct;
typedef  PACKED_STRUCT(const char)                unaligned_const_char_struct;
typedef  PACKED_STRUCT(char)                      unaligned_char_struct;

#undef   PACKED_STRUCT

#define  HELPER_FUNCTION  __attribute__((__unused__)) static inline volatile

HELPER_FUNCTION        unaligned_unsigned_long_long_struct        *unaligned_unsigned_long_long_at                               (unsigned long long *ref) { return (volatile       unaligned_unsigned_long_long_struct       *)ref; }
HELPER_FUNCTION        unaligned_unsigned_long_long_struct        *unaligned_volatile_unsigned_long_long_at             (volatile unsigned long long *ref) { return (volatile       unaligned_unsigned_long_long_struct       *)ref; }
HELPER_FUNCTION  const unaligned_const_unsigned_long_long_struct  *unaligned_const_unsigned_long_long_at                   (const unsigned long long *ref) { return (volatile const unaligned_const_unsigned_long_long_struct *)ref; }
HELPER_FUNCTION  const unaligned_const_unsigned_long_long_struct  *unaligned_const_volatile_unsigned_long_long_at (const volatile unsigned long long *ref) { return (volatile const unaligned_const_unsigned_long_long_struct *)ref; }

HELPER_FUNCTION        unaligned_long_long_struct        *unaligned_long_long_at                               (long long *ref) { return (volatile       unaligned_long_long_struct       *)ref; }
HELPER_FUNCTION        unaligned_long_long_struct        *unaligned_volatile_long_long_at             (volatile long long *ref) { return (volatile       unaligned_long_long_struct       *)ref; }
HELPER_FUNCTION  const unaligned_const_long_long_struct  *unaligned_const_long_long_at                   (const long long *ref) { return (volatile const unaligned_const_long_long_struct *)ref; }
HELPER_FUNCTION  const unaligned_const_long_long_struct  *unaligned_const_volatile_long_long_at (const volatile long long *ref) { return (volatile const unaligned_const_long_long_struct *)ref; }

HELPER_FUNCTION        unaligned_unsigned_long_struct        *unaligned_unsigned_long_at                               (unsigned long *ref) { return (volatile       unaligned_unsigned_long_struct       *)ref; }
HELPER_FUNCTION        unaligned_unsigned_long_struct        *unaligned_volatile_unsigned_long_at             (volatile unsigned long *ref) { return (volatile       unaligned_unsigned_long_struct       *)ref; }
HELPER_FUNCTION  const unaligned_const_unsigned_long_struct  *unaligned_const_unsigned_long_at                   (const unsigned long *ref) { return (volatile const unaligned_const_unsigned_long_struct *)ref; }
HELPER_FUNCTION  const unaligned_const_unsigned_long_struct  *unaligned_const_volatile_unsigned_long_at (const volatile unsigned long *ref) { return (volatile const unaligned_const_unsigned_long_struct *)ref; }

HELPER_FUNCTION        unaligned_long_struct        *unaligned_long_at                               (long *ref) { return (volatile       unaligned_long_struct       *)ref; }
HELPER_FUNCTION        unaligned_long_struct        *unaligned_volatile_long_at             (volatile long *ref) { return (volatile       unaligned_long_struct       *)ref; }
HELPER_FUNCTION  const unaligned_const_long_struct  *unaligned_const_long_at                   (const long *ref) { return (volatile const unaligned_const_long_struct *)ref; }
HELPER_FUNCTION  const unaligned_const_long_struct  *unaligned_const_volatile_long_at (const volatile long *ref) { return (volatile const unaligned_const_long_struct *)ref; }

HELPER_FUNCTION        unaligned_unsigned_int_struct        *unaligned_unsigned_int_at                               (unsigned int *ref) { return (volatile       unaligned_unsigned_int_struct       *)ref; }
HELPER_FUNCTION        unaligned_unsigned_int_struct        *unaligned_volatile_unsigned_int_at             (volatile unsigned int *ref) { return (volatile       unaligned_unsigned_int_struct       *)ref; }
HELPER_FUNCTION  const unaligned_const_unsigned_int_struct  *unaligned_const_unsigned_int_at                   (const unsigned int *ref) { return (volatile const unaligned_const_unsigned_int_struct *)ref; }
HELPER_FUNCTION  const unaligned_const_unsigned_int_struct  *unaligned_const_volatile_unsigned_int_at (const volatile unsigned int *ref) { return (volatile const unaligned_const_unsigned_int_struct *)ref; }

HELPER_FUNCTION        unaligned_int_struct        *unaligned_int_at                               (int *ref) { return (volatile       unaligned_int_struct       *)ref; }
HELPER_FUNCTION        unaligned_int_struct        *unaligned_volatile_int_at             (volatile int *ref) { return (volatile       unaligned_int_struct       *)ref; }
HELPER_FUNCTION  const unaligned_const_int_struct  *unaligned_const_int_at                   (const int *ref) { return (volatile const unaligned_const_int_struct *)ref; }
HELPER_FUNCTION  const unaligned_const_int_struct  *unaligned_const_volatile_int_at (const volatile int *ref) { return (volatile const unaligned_const_int_struct *)ref; }

HELPER_FUNCTION        unaligned_unsigned_short_struct        *unaligned_unsigned_short_at                               (unsigned short *ref) { return (volatile       unaligned_unsigned_short_struct       *)ref; }
HELPER_FUNCTION        unaligned_unsigned_short_struct        *unaligned_volatile_unsigned_short_at             (volatile unsigned short *ref) { return (volatile       unaligned_unsigned_short_struct       *)ref; }
HELPER_FUNCTION  const unaligned_const_unsigned_short_struct  *unaligned_const_unsigned_short_at                   (const unsigned short *ref) { return (volatile const unaligned_const_unsigned_short_struct *)ref; }
HELPER_FUNCTION  const unaligned_const_unsigned_short_struct  *unaligned_const_volatile_unsigned_short_at (const volatile unsigned short *ref) { return (volatile const unaligned_const_unsigned_short_struct *)ref; }

HELPER_FUNCTION        unaligned_short_struct        *unaligned_short_at                               (short *ref) { return (volatile       unaligned_short_struct       *)ref; }
HELPER_FUNCTION        unaligned_short_struct        *unaligned_volatile_short_at             (volatile short *ref) { return (volatile       unaligned_short_struct       *)ref; }
HELPER_FUNCTION  const unaligned_const_short_struct  *unaligned_const_short_at                   (const short *ref) { return (volatile const unaligned_const_short_struct *)ref; }
HELPER_FUNCTION  const unaligned_const_short_struct  *unaligned_const_volatile_short_at (const volatile short *ref) { return (volatile const unaligned_const_short_struct *)ref; }

HELPER_FUNCTION        unaligned_unsigned_char_struct        *unaligned_unsigned_char_at                               (unsigned char *ref) { return (volatile       unaligned_unsigned_char_struct       *)ref; }
HELPER_FUNCTION        unaligned_unsigned_char_struct        *unaligned_volatile_unsigned_char_at             (volatile unsigned char *ref) { return (volatile       unaligned_unsigned_char_struct       *)ref; }
HELPER_FUNCTION  const unaligned_const_unsigned_char_struct  *unaligned_const_unsigned_char_at                   (const unsigned char *ref) { return (volatile const unaligned_const_unsigned_char_struct *)ref; }
HELPER_FUNCTION  const unaligned_const_unsigned_char_struct  *unaligned_const_volatile_unsigned_char_at (const volatile unsigned char *ref) { return (volatile const unaligned_const_unsigned_char_struct *)ref; }

HELPER_FUNCTION        unaligned_signed_char_struct        *unaligned_signed_char_at                               (signed char *ref) { return (volatile       unaligned_signed_char_struct       *)ref; }
HELPER_FUNCTION        unaligned_signed_char_struct        *unaligned_volatile_signed_char_at             (volatile signed char *ref) { return (volatile       unaligned_signed_char_struct       *)ref; }
HELPER_FUNCTION  const unaligned_const_signed_char_struct  *unaligned_const_signed_char_at                   (const signed char *ref) { return (volatile const unaligned_const_signed_char_struct *)ref; }
HELPER_FUNCTION  const unaligned_const_signed_char_struct  *unaligned_const_volatile_signed_char_at (const volatile signed char *ref) { return (volatile const unaligned_const_signed_char_struct *)ref; }

HELPER_FUNCTION        unaligned_char_struct        *unaligned_char_at                               (char *ref) { return (volatile       unaligned_char_struct       *)ref; }
HELPER_FUNCTION        unaligned_char_struct        *unaligned_volatile_char_at             (volatile char *ref) { return (volatile       unaligned_char_struct       *)ref; }
HELPER_FUNCTION  const unaligned_const_char_struct  *unaligned_const_char_at                   (const char *ref) { return (volatile const unaligned_const_char_struct *)ref; }
HELPER_FUNCTION  const unaligned_const_char_struct  *unaligned_const_volatile_char_at (const volatile char *ref) { return (volatile const unaligned_const_char_struct *)ref; }

#undef  HELPER_FUNCTION

#endif /* UNALIGNED() */

which requires C11 _Generic support and packed attribute support to work. It generates no global or static symbols or functions, and generates no machine code if not used. Specifically, the __attribute__((unused)) attribute and static qualifier should tell the compiler that these helper functions are local to the current compilation unit (C source file) and can/should be inlined, but no code nor any warning needs to be generated if they are not used.

It exports an UNALIGNED(pointer_expression) macro, which dereferences the supplied pointer expression using the pointer target type, without requiring the pointer to be properly aligned. It can be used for assignments also. For example:
struct { int a, b; } foo;
int foo_sum(void) { return UNALIGNED(&foo.a) + UNALIGNED(&foo.b); }
int foo_clear(void) { UNALIGNED(&foo.a) = 0; UNALIGNED(&foo.b) = 0; }
will generate separate ldr and str instructions for the two fields on Cortex-M3.

Note that it should support the proper qualifiers, i.e. with
const struct { int a, b; } foo = { .a = 1, .b = 2 };
or with
const struct { const int a; int b; } foo = { .a = 1, .b = 2 };
foo_sum() will work, but foo_clear() fail at compile time (because of trying to modify a read-only object).

It also works for the
int test1(char *buf) { return UNALIGNED((int *)(buf + 16)) + UNALIGNED((int *)(buf + 20)); }
and
char buffer[32];
int test2(void) { return UNALIGNED((int *)(buffer + 16)) + UNALIGNED((int *)(buffer + 20)); }
cases, generating separate loads for each int.

For current C compilers, the fixed-width types defined in <stdint.h> are based on the standard types, so the above should work for those types also.
They are trivial to add on afterwards. They cannot be added in just in case, because duplicate type specifications in the _Generic statement causes warnings/errors. Similarly, adding float, double, and long double floating-point type support should be straightforward.

(Each additional type adds four lines (type:, const type:, volatile type:, and const volatile type:) into the _Generic, two PACKED_STRUCT structures (one const, the other non-const), and four HELPER_FUNCTION helper functions.)

Note that the char type is special, because signed char is not necessarily the same type as char. For short, int, long (= long int), and long long (= long long int), signed itype is the same as itype.

SiliconWizard · « **Reply #56 on:** October 05, 2023, 07:11:21 pm »

Quote from: Nominal Animal on October 05, 2023, 06:44:49 pm

Quote from: gf on October 04, 2023, 09:18:46 pm
I think this kind of propagation not guaranteed by the C standard, so a portable program should possibly not rely on it.
As even the packed attribute is a C extension, and not part of the C standard, none of this is "guaranteed"; and it basically depends on current compilers' optimization strategies. (I use one of -O2, -Os, -Oz (clang), or -Og, plus any specific optimization featuress I happen to need.)

Yep. It is not standard, and thus not portable.

Quote from: Nominal Animal on October 05, 2023, 06:44:49 pm

For GCC use, I would personally add a note in the build documentation about relying on this,

Absolutely, something I have recommended as well. Document whatever part of your code is not portable, why, and how to port it if needed.

As I said, I personally try to avoid unaligned accesses as much as possible in general, rather than having to rely on compiler specifics. In other words, the main case where I would use packed structs is when I can guarantee the natural alignment of all members - which obviously is only if I can control the layout of said structs.

When dealing with packed structs with unaligned members, I'll usually use a memcpy() approach or similar, rather than compiler "tricks" to handle unaligned accesses properly.
But of course, like all rules, there are exceptions, and in this case your suggestions above look useful.

Nominal Animal · « **Reply #57 on:** October 05, 2023, 08:39:37 pm »

Quote from: SiliconWizard on October 05, 2023, 07:11:21 pm

When dealing with packed structs with unaligned members, I'll usually use a memcpy() approach or similar, rather than compiler "tricks" to handle unaligned accesses properly.

Yep; as I said, for multi-byte integer values I tend to use accessor functions that take an unsigned char buffer, and calculate the multi-byte value explicitly. While they do not always generate optimal or even near-optimal code, they never bug out on me.

The macro approach is an alternative to those who refuse to use or cannot use inline accessor functions for some reason.

My above macro suggestion does "fail" for RISC-V (rv32gc) using clang, whenever the compiler can determine at compile time the address will be properly aligned.

Renaming the macro as MAYBE_UNALIGNED() would make sense, because then it intuitively works correctly even on rv32gc: it dereferences the specified pointer expression to the pointed-to value (if it is one of the base integer types, currently) without combining loads, using unaligned accesses unless the compiler can infer the pointer is aligned at compile time.

That is,
struct { int a, b; } foo;
int foo_sum(void) { return UNALIGNED(&foo.a) + UNALIGNED(&foo.b); }
will generate aligned loads on rv32gc when compiled with clang. Even
typedef struct { int a, b; } __attribute__((packed)) foo;
int foo_sum(foo *f) { return UNALIGNED(&(f->a)) + UNALIGNED(&(f->b)); }
foo f;
int f_sum(void) { return foo_sum(&f); }
generates aligned loads for f_sum(). Of course, foo_sum() when called from a different compilation unit (and whenever the compier cannot determine if f is aligned or not), will construct the value byte-by-byte, as one would expect.

This should not be an issue for real world code, though. Even for rv32gc, the typical use case, e.g.
int test(char *buf) { return UNALIGNED((int *)(buf + 16)) + UNALIGNED((int *)(buf + 20)); }
generates proper unaligned loads (byte by byte). Analogously, for
char buffer1[32] __attribute__((aligned (1)));
char buffer2[32] __attribute__((aligned (2)));
char buffer4[32] __attribute__((aligned (4)));
int sum1(void) { return UNALIGNED((int *)(buffer1 + 16)) + UNALIGNED((int *)(buffer1 + 20)); }
int sum2(void) { return UNALIGNED((int *)(buffer2 + 16)) + UNALIGNED((int *)(buffer2 + 20)); }
int sum4(void) { return UNALIGNED((int *)(buffer4 + 16)) + UNALIGNED((int *)(buffer4 + 20)); }
function sum1() generates eight byte loads, sum2() generates four halfword loads, and sum4() two word loads, because clang can determine the alignment (within the same compilation unit) at compile time here.

The macro definition is thus quite useful even on rv32gc, because by casting some pointer expression into a pointer to int, clang assumes you know the result will be aligned to a four-byte boundary. That is, given
int get_int(const volatile void *ptr) { return *(const volatile int *)ptr; }
clang will generate a single load word instruction on rv32gc, but
int get_int(const volatile void *ptr) { return UNALIGNED((const volatile int *)ptr); }
clang will load the value using four byte loads (again, unless it can infer at compile time that ptr is aligned to a 2-byte or 4-byte boundary).
Which is generally what we programmers want.

gf · « **Reply #58 on:** October 05, 2023, 09:49:27 pm »

Quote from: Nominal Animal on October 05, 2023, 06:44:49 pm

As mentioned previously, all of this would be easily solved by a __builtin_unaligned(pointer_expression) built-in,

Analog to __builtin_assume_aligned(), the unaligned property of values would, of course, still be lost at boundaries that are not (or cannot be) crossed by the data flow analysis (such as function boundaries of non-inlined functions).

Quote

or even an unaligned C qualifier keyword.

You mean as a type qualifier like const, volatile, restrict?

Quote from: SiliconWizard on October 05, 2023, 07:11:21 pm

When dealing with packed structs with unaligned members, I'll usually use a memcpy() approach or similar, rather than compiler "tricks" to handle unaligned accesses properly.

The memcpy() approach is only required with regular structs, having an alignment > 1. Packed structs work out of the box with unaligned pointer values. The alignment of packed structs, as well as the alignment of their members, is by definition 1. Just access any member (e.g. p->member1, where p is a pointer to a packed struct), and the access is considered unaligned by the compiler anyway. No "tricks" required besides declaring the struct with __attribute__((packed)).

[ Nevertheless, if the compiler can prove at compile time that a particular unaligned access can safely be replaced by an aligned one, it is free to optimize -- as with all other optimizations. If the compiler cannot prove, it must no optimize. You should not need to care whether it does -- the result should be the same. ]

However, one thing is dangerous: Don't take the address of a packed struct's member of type T and assign it to a T* pointer!
Subsequently, it is not safe to dereference this T* pointer (unless T is a character type, or again a packed struct). Don't ignore the warning

Quote

warning: taking address of packed member of 'struct <anonymous>' may result in an unaligned pointer value [-Waddress-of-packed-member]

Nominal Animal · « **Reply #59 on:** October 06, 2023, 12:00:14 am »

Quote from: gf on October 05, 2023, 09:49:27 pm

Analog to __builtin_assume_aligned(), the unaligned property of values would, of course, still be lost at boundaries that are not (or cannot be) crossed by the data flow analysis (such as function boundaries of non-inlined functions).

True. In my own use cases, unaligned-ness is a property of the access I do, and not a property of a variable or type.

That is, I never need to mark a variable unaligned; I only need to mark certain pointers as unaligned just before I dereference them.

Quote from: gf on October 05, 2023, 09:49:27 pm

Quote
or even an unaligned C qualifier keyword.
You mean as a type qualifier like const, volatile, restrict?

Yes.

Quote from: gf on October 05, 2023, 09:49:27 pm

Packed structs work out of the box with unaligned pointer values.

Yes, in the sense that
typedef struct { int a, b; } __attribute__((packed)) mystruct;
int mystruct_sum(mystruct *m) { return m->a + m->b; }
will generate unaligned loads.

But, as you indicated, if you then add
mystruct foo;
int foo_sum(void) { return mystruct_sum(&foo); }
both gcc and clang use aligned loads in foo_sum() inlining the mystruct_sum() call, and on Cortex-M3, depending on optimization and compiler, usually combines the loads (LDRD or LDRM). Replacing (mystruct *m) above with (volatile mystruct *m), will stop the combining, due to C rules wrt. volatile object accesses.

Thus, there are two separate details here: one is alignment (or lack thereof), and another is load/store combining on architectures like Cortex-M3.
Using the packed attribute on the structure allows safe unaligned access to members of that structure.
Using the volatile qualifier for the accessed object (structure and/or member) stops load/store combining.

Quote from: gf on October 05, 2023, 09:49:27 pm

However, one thing is dangerous: Don't take the address of a packed struct's member of type T and assign it to a T* pointer!

This is very important!

When you cast an expression into a pointer to some type T, you also promise that it is aligned sufficiently for that type T.
When you take the address of a member in a packed structure, it may not be sufficiently aligned for a pointer to that type!

Casting a pointer expression to a pointer to char (or unsigned char or signed char) is special, as it is the way to access the object's storage representation, the in-memory data corresponding to the value of that object.

gf · « **Reply #60 on:** October 06, 2023, 09:25:52 am »

Quote from: Nominal Animal on October 06, 2023, 12:00:14 am

Thus, there are two separate details here: one is alignment (or lack thereof), and another is load/store combining on architectures like Cortex-M3.
Using the packed attribute on the structure allows safe unaligned access to members of that structure.
Using the volatile qualifier for the accessed object (structure and/or member) stops load/store combining.

Exactly, these are separate issues. But one should not have to care explicitly about load/store combining at all. It's just an optimization. If the compiler combines unexpectedly, then the program is already wrong with respect to semantics and guarantees provided by the language (or language extension). What I mean is, one should not think in terms of generated assembly and how you can outwit the compiler to generate the desired assembly code, but rather think in terms of C semantics (and only C semantics) in the fist place. The compiler will (hopefully - assuming no bugs) do the right thing when it maps the abstract C semantics of the program to machine code.

Quote

Quote from: gf on October 05, 2023, 09:49:27 pm
However, one thing is dangerous: Don't take the address of a packed struct's member of type T and assign it to a T* pointer!
This is very important!

When you cast an expression into a pointer to some type T, you also promise that it is aligned sufficiently for that type T.
When you take the address of a member in a packed structure, it may not be sufficiently aligned for a pointer to that type!

The pitfall is, of course, if the type of the member is (say) int and the pointer type is int*, then no type cast is required, and at the first glance you might think the assignment is fine, although it is not. Luckily, gcc issues a warning.

Nominal Animal · « **Reply #61 on:** October 06, 2023, 04:20:44 pm »

Quote from: gf on October 06, 2023, 09:25:52 am

What I mean is, one should not think in terms of generated assembly and how you can outwit the compiler to generate the desired assembly code, but rather think in terms of C semantics (and only C semantics) in the fist place. The compiler will (hopefully - assuming no bugs) do the right thing when it maps the abstract C semantics of the program to machine code.

Absolutely true.

The main case where load combining would be an issue is with specific memory-mapped peripherals where certain registers need to be accessed in a specific order. There, one should use volatile anyway –– for example, for the packed structure that represents the register fields, or its members –– which stops the load/store combining.

The other cases I can think of –– spinlocks, generation counters, etc. –– are either similarly solved by volatile, or in rare extreme cases –– locking, or lockless access primitives, based on ll/sc (load-linked, store-conditional) where exact machine instruction sequences are needed –– with gcc-compatible architecture-specific extended inline assembly. Unlike pure assembly sources, when written as static inline accessor functions, using proper machine constraints, these optimize extremely well into their surrounding C code.

Quote from: gf on October 06, 2023, 09:25:52 am

Luckily, gcc issues a warning.

Yes, which is one more reason to always enable warnings by default when writing code!

DiTBho · « **Reply #62 on:** October 07, 2023, 10:43:13 am »

Quote from: Nominal Animal on October 06, 2023, 04:20:44 pm

Quote from: gf on October 06, 2023, 09:25:52 am
What I mean is, one should not think in terms of generated assembly and how you can outwit the compiler to generate the desired assembly code, but rather think in terms of C semantics (and only C semantics) in the fist place. The compiler will (hopefully - assuming no bugs) do the right thing when it maps the abstract C semantics of the program to machine code.
Absolutely true.

but that's why thread cannot be safely implemented as C-library, neither you can assume C semantics will produce the right assembly code for trmem, and the more you push { space, speed} optimization, the more likely it won't

That's why I always suggest to directly use assembly for this stuff: because you have the full controll of it!

With my-c you are always guaranteed to get the correct assembly code, but!

but it's only aimed at MIPS5++(2) and doesn't consider other architectures, and I'm sure if I started supporting ARM Cortex things would start to get worse
heavily uses monads and monad semantics to describe the desired behavior(1)
which usually means robust and protable code, and right assembly, but excessive glue code
which means slow performance and large final binary file size

if you think about it, my-c works better precisely because the optimizer works worse

Actually, to be honest, the my-C optimizer does almost nothing, and it's a great thing for me as you don't even have to care about the problems for which in C you often have to use "volatile" to ensure that the optimizer won't "asphalt your code" like a crushes stones vehicle driven by a monkey.

(1) so, "alignment" is entirely handled by monads, you can create one that takes this into account, you can define a type that does this and use it with a simple typedef. No special pragma or compiler-specific magic directive is needed, and even the pun is handled by monad operators, so exceptions are again handled by monads. Everything is monad oriented, so you can do everything as long as you can define it with monad semantic.

In fact, if you want you can also define monads for "casting", instead of using wilder but very fast "unchecked conversion" (unsafe, but 10x faster!!!) methods.

So you have code that reacts automatically and in the way you want it to react when it accidentally passes a NULL pointer, or something for which in C you would definitely get an "undefine behavior". And if you really don't know what to do, you can create a simple monad that points to a panic() in case things don't look right.

(2) with a clever use for specific instructions to access data out of alignment, access data on shared memory, access data on transactional memory, directly manage the cache, directly manage the pipeline etc

Nominal Animal · « **Reply #63 on:** October 07, 2023, 08:09:24 pm »

Quote from: DiTBho on October 07, 2023, 10:43:13 am

but that's why thread cannot be safely implemented as C-library, neither you can assume C semantics will produce the right assembly code for trmem, and the more you push { space, speed} optimization, the more likely it won't

That's why I always suggest to directly use assembly for this stuff: because you have the full controll of it!

If we define 'this stuff' as these hardware-details (transactional memory, spinlocks, ll/sc-based locks or lockless accessors), we are in absolute agreement.

With GCC and clang, correctly written extended inline assembly will use machine constraints and "references" for the registers used, so that the C compiler can optimize the (inlined) code for each use site. For full functions that will always be called and not inlined, external assembly is absolutely fine.

Quote from: DiTBho on October 07, 2023, 10:43:13 am

If you think about it, my-c works better precisely because the optimizer works worse

This is a key observation. The more limited or stricter the optimization strategy for a C-like language, the better the control over the exact code generated, but the worse the portability (especially wrt. compilers generating code for a different hardware achitecture, from the same source) becomes.

While "new" languages like Rust, Julia, etc. are developed to hopefully avoid the core concepts of C that cause that effect, only time will tell, really.

Quote from: DiTBho on October 07, 2023, 10:43:13 am

Actually, to be honest, the my-C optimizer does almost nothing, and it's a great thing for me as you don't even have to care about the problems for which in C you often have to use "volatile" to ensure that the optimizer won't "asphalt your code" like a crushes stones vehicle driven by a monkey.

Have you noticed how I always accompany my suggested code snippets including volatile with a specific explanation along the lines of "it stops the compiler from caching and inferring the value from surrounding code, as the value of such variables can change or be modified by code not seen by the compiler", exactly because it is such a heavy hammer? It is way too common for C programmers to simply sprinkle them on variables semi-randomly, until the code seems to work; the often described 'lets throw spaghetti at the wall to see what sticks' -approach.

Base C is a very simple language with a very complex optimization engines bolted on top, to use it effectively and to write portable efficient code, one needs to understand a lot about the language, its theoretical model (the abstract machine the language specification used), as well as existing machine architectures and their differences. One of my pet peeves is the ubiquitous opendir()/readdir()/closedir() example/exercise/use case, which is wrong on most current operating systems, because the directory tree may be modified during scanning, and none of the existing examples take that into account. The proper solution is to use POSIX scandir(), glob(), or nftw(), or fts family of functions from BSD and derivatives, which are supposed to work even when the directory tree is concurrently modified. To implement these, you need to either use a helper process (fts family, using its current working directory to walk the directory tree), or so-called "atfile" support (as e.g. standardized in POSIX via openat(), fstatat(), etc.). Exactly why, involves understanding how file systems are implemented, and their access properties (what is atomic, what is not, and so on).

With experience, that understanding distills into rules of thumb –– like using memcpy() or accessor functions, instead of 'tricks' like the UNALIGNED() macros I showed above, to ensure correct machine instruction level access patterns ––, often ending up "codified" in programming howtos and guides and books; but when used without the true understanding, easily leads to misuse and inefficient/buggy code.

A good example of that is when using low-level POSIX/Unix/BSD I/O from <unistd.h>, i.e. read() and write(). Ages ago, operating systems never returned short counts for normal files. This belief still persists today, even though it is absolutely false. First, on POSIXy systems a signal delivery to an userspace handler installed without the SA_RESTART flag can cause them to fail with errno==EINTR; some filesystems, like Linux userspace (FUSE) ones, can return short reads or writes whenever they want, even for local files; slow network connections can cause short reads from shared network folders; and pipes and sockets often return short reads or writes. I've had dozens of arguments about this with otherwise very proficient C programmers, with their argument basically boiling down to "that doesn't happen to me, so I don't care".

As to security aspects, don't even get me started.

This leads to an annoying dichotomy on my own part. With threads like this one, where a specific detail is discussed, I do not usually even consider whether there are real use cases for applying the detail or not; I just discuss what I know about it, because I tend to suspect there is an use case, or the OP would have discussed the problem they're trying to solve via that detail. With threads like this one about opendir()/chdir() on this forum, my response will be severely annoying (sorry, MikeK) even if/when they are explicitly useful/correct. See my "original" answer to that question in 2015 at StackOverflow, read the comments, and note how it was not the correct answer to the asker. To me, it really feels like seeing babies draw on the kitchen cabinets with their own poop.

I suspect that something like that dichotomy, mixed with experience that goes beyond the book examples and single architectures, and experience that is based on the book examples and having found that sufficient, is the underlying reason why so many threads about C details branch out and get a bit 'flame-y'.
Now, add to that useful pieces from domain- and hardware-specific variants of C like DiTBho's my-c, or my own that makes arrays base-level objects (allowing buffer overrun detection and tracking at compile time through function hierarchies), and conflagration is nearly assured.

Add to that the high-level concepts like monads (or even threads!) that can be used to sidestep many of the issues in real-world code, and the discussion will vary from friendly to heated, from practical to theoretical, and so on. I for one like to try and be useful, and find all of those aspects interesting, but that leads to walls of text like this post. $:-\$

Apologies for this and the preceding over-long posts. I'm still trying to learn how to be more concise.

ejeffrey · « **Reply #64 on:** October 09, 2023, 02:03:17 pm »

Quote from: DiTBho on October 07, 2023, 10:43:13 am

Actually, to be honest, the my-C optimizer does almost nothing, and it's a great thing for me as you don't even have to care about the problems for which in C you often have to use "volatile" to ensure that the optimizer won't "asphalt your code" like a crushes stones vehicle driven by a monkey.

I guess that can be helpful in some situations but storing values in registers is a pretty important and fundamental optimization. Omitting that's is fine for something that's basically a "low level scripting language" where you don't really care about performance, but it's not great for a general purpose language. You can definitely have a "register" keyword that means the opposite of volatile and use it for the most performance critical values but I think you end up with the same problem.

Also if you want to run on multi core systems (and maybe you don't if the whole premise is that performance isn't a goal) this doesn't really help you for all the reasons that C programmers discovered in the 1990s and 2000s.

DiTBho · « **Reply #65 on:** October 09, 2023, 02:58:08 pm »

Quote from: ejeffrey on October 09, 2023, 02:03:17 pm

I guess that can be helpful in some situations but storing values in registers is a pretty important and fundamental optimization

my-c is helpful as in my opinion monades avoid you to pass through weird confusing keywords ("register" is deprecated, d'oh) or "magic" flags (Gcc) that you have to pass among the various compiler flags, or worse with pragmas.

The datatype encloses the behavior, which in turn models the code in the machine layer without further steps. It is the monad that literally dictates the assembly code, and no one can alter what it dictates.

So, in my-c what you write as HL code is exactly what you get as assembly, and you don't even need to constantly peek at the assembly spit out by cc1, since - this was/is my point - the optimization layer does not distort what you expect the code to do, it does not "unroll loops", it does not "replace anything with memcopy", it never allows itself to remove dead code, nor to eliminate a loop simply because it assumes that the loop condition is always False, it does absolutely none of this, it does nothing that has not been explicitly requested through a meticulous behavioral description that passes through the datatype!

Which means that if you want a variable to be handled not on the stack but in a register, you have to declare it with a monad that does exactly that, and what you get is in the order

verify that what you ask for is possible, and if it is not, the compiler refuses to compile, specifying in a concise (for now barely intelligible) way why you cannot get what you ask for
the implementation of what you expressly requested!

what you write is what you get: { nothing | job done }

(I think this is the dream HL compiler
of every hardware developer
who finds themselves writing software...)

This whole project was born precisely because I have to support a multicore MIPS5++, of which I have to be sure that the things I write in HL are implemented exactly as requested.
(this, because the hardware is really bizarre ...)


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

Author Topic: Cortex-M3+ GCC unaligned access quirks (Read 6465 times)

Share me