Author Topic: Cortex-M3+ GCC unaligned access quirks (Read 6442 times)

gf · « **Reply #50 on:** October 04, 2023, 01:59:05 pm »

Quote from: Nominal Animal on October 04, 2023, 12:04:41 am

Quote from: bson on October 03, 2023, 10:32:02 pm
I don't see how this will help you at all. ~~An optimizing compiler~~ gcc can still combine multiple 32-bit loads and stores into ldm/stm.
Did you actually verify that? I am claiming gcc (and gcc-compatible compilers like clang and sdcc) does not combine multiple 32-bit loads into ldm if you use the code patterns I showed, and generates 'ldr rn, [rm] @ unaligned' and 'ldr rn, [rm, offset] @ unaligned' (i.e., unaligned-safe loads on Cortex-M3) instead.

It depends on the context, particularly on what the compiler happens to know about the addresses from data flow analysis.

Example: https://godbolt.org/z/b71o8Ks9K

Here, the compiler knows that it places buffer[] at a >= 4-byte aligned address (it is the first object in .bss, and .bss is even 8-byte aligned). Therefore, in test1(), the compiler can still (safely) combine the two loads at buffer+16 and buffer+20 using "ldrd", knowing that these addresses are also properly aligned, even though the two int32_t objects are accessed as members of a packed struct.

However, in test2(), the compiler does not have information about the address stored in ptr. As a result, this optimization is not possible, leading to unaligned "ldr" instructions being emitted (due to accessing the two objects as members of a packed struct).

Nominal Animal · « **Reply #51 on:** October 04, 2023, 02:43:32 pm »

Quote from: gf on October 04, 2023, 01:59:05 pm

It depends on the context, particularly on what the compiler happens to know about the addresses from data flow analysis.

Example: https://godbolt.org/z/b71o8Ks9K

I was unclear: you need the exactly once approach to avoid load combining on Cortex-M3, and you get that using a temporary (differently-qualified or different-type) pointer to the data, that ensures exact individual loads with current gcc and clang optimization engines:

Code: [Select]

int32_t test1(void)
{
    const volatile S *const ref = buffer;
    return (ref+16)->i32 +
           (ref+20)->i32;
}

With trunk gcc or clang on Cortex-M3, -Os gives you unaligned individual loads, and -O2 and -O3 individual aligned loads.

(Note: volatile may not be necessary in all cases. It is the type-cast pointer use that acts as the barrier to load merging, and volatile is the logically correct qualifier for the access here ("compiler, do not cache the value of this access"). As there is type or qualifier difference between the original reference and the reference used for the access, the optimization engines will not ignore the temporary pointer, and will thus use the "rules" set by the temporary pointer for the access.)

Essentially, rewriting your example code using my approach, you get

Code: [Select]

#include <stdint.h>

typedef struct {
    int32_t i32;
} __attribute__((__packed__)) S;

unsigned char buffer[64];

int32_t test1(void)
{
    const volatile S *const src = buffer;
    return (src+16)->i32 +
           (src+20)->i32;
}

int32_t test2(const unsigned char *buf)
{
    const volatile S *const src = buf;
    return (src+16)->i32 +
           (src+20)->i32;
}

which compiles to, depending on the compiler and optimization level, to essentially

Code: [Select]

test1:
    ldr     r3, .L3
    ldr     r0, [r3, #64]
    ldr     r3, [r3, #80]
    add     r0, r0, r3
    bx      lr

.L3:
    .word   .LANCHOR0

test2:
    ldr     r2, [r0, #64]     @ unaligned
    ldr     r0, [r0, #80]     @ unaligned
    add     r0, r0, r2
    bx      lr

buffer:

with the second and third lines/ldr instructions sometimes (depending on optimization) using the @ unaligned unaligned-safe instruction variant (as the compiler chooses the location of buffer, technically those two loads are never unaligned); but never combining the loads.

On OS ABI versions (you can run e.g. Linux on Cortex-M3), the buffer address is typically relative to PC, so the buffer address calculation will differ on those; but the actual loads will still be via individual ldr instructions, instead of a combined one.

I added an EDITED paragraph and an EDITED sentence to my post, to hopefully clarify this. Let me know if any of you object, or find a case where this pattern fails.

gf · « **Reply #52 on:** October 04, 2023, 04:47:52 pm »

Quote from: Nominal Animal on October 04, 2023, 02:43:32 pm

I was unclear: you need the exactly once approach to avoid load combining on Cortex-M3, and you get that using a temporary pointer to (volatile) data

Completely agree, accessing the i32 member via a volatile S* pointer has of course volatile semantics, preventing load combining.

Btw, your version of test1

Code: [Select]

int32_t test1(void)
{
    const volatile S * const ref = buffer;
    return (ref+16)->i32 + (ref+20)->i32;
}

cannot combine the loads at all, because the two objects are not adjacent, but have an offset of 16 bytes. So you always get indiviual ldr instructions, both with and without volatile.

Quote from: Nominal Animal on October 04, 2023, 02:43:32 pm

Note: volatile may not be necessary in all cases. It is the type-cast pointer use that acts as the barrier to load merging...

A type-cast to a non-volatile pointer obviously does not prevent load merging: https://godbolt.org/z/dh8xb8Y81
Why should it?

[ ~~Yet another reason to renounce load merging is, of course, if the compiler cannot prove adjacency and proper alignment of the addresses at compile time. If the compiler is unsure, it must not merge.~~
EDIT: Yet another reason to renounce load merging is, of course, if the compiler cannot prove adjacency or if it can disprove proper alignment of the addresses at compile time. ]

EDIT: A store can, of course, also act as a load merging barrier (-> strict aliasing): https://godbolt.org/z/Pb4GMTdsY

Nominal Animal · « **Reply #53 on:** October 04, 2023, 06:32:56 pm »

Quote from: gf on October 04, 2023, 04:47:52 pm

Btw, your version of test1 cannot combine the loads at all, because the two objects are not adjacent, but have an offset of 16 bytes. So you always get indiviual ldr instructions, both with and without volatile.

D'oh! True. To fix, replace (src+16) and (src+20) with (src+4) and (src+5).

This has no effect on my argument, as no combining will occur with the fix applied, either.

Quote from: gf on October 04, 2023, 04:47:52 pm

Quote from: Nominal Animal on October 04, 2023, 02:43:32 pm
Note: volatile may not be necessary in all cases. It is the type-cast pointer use that acts as the barrier to load merging...
A type-cast to a non-volatile pointer obviously does not prevent load merging: https://godbolt.org/z/dh8xb8Y81

Load merging will not occur if the buffer data is volatile, ie. when you have volatile char buffer[64];, nor when the pointer address is based on a char pointer whose value is unknown at compile time, ie. int32_t test1(char *buffer).

volatile is the correct qualifier to use in all cases, but one should not be surprised if in certain cases the compiler will generate unaligned non-merged accesses even when volatile is omitted. (Those certain cases are when the compiler cannot make any assumptions about the address being referenced, and cannot cache or use a cached value at that address, because there is no reuse.)

Quote from: gf on October 04, 2023, 04:47:52 pm

[ Yet another reason to renounce load merging is, of course, if the compiler cannot prove adjacency and proper alignment of the addresses at compile time. If the compiler is unsure, it must not merge. ]

True. Here, the sequence point rules (specifically, order and separateness of accesses to volatile-qualified objects) are even more important, because they stop the compiler from merging even when it can prove adjacency and alignment at compile time.

Hmm.. this should mean that by declaring the unaligned_nn member(s) volatile would achieve the same; and quick testing implies this is the case.

Therefore, for unaligned_nn access structures, perhaps I should suggest something like the following, instead of the ones I showed earlier?

Code: [Select]

#include <stdint.h>

typedef struct {
    union {
        volatile uint16_t    u16;
        volatile int16_t     i16;
    };
} __attribute__((__packed__)) unaligned_16;

typedef struct {
    union {
        volatile uint32_t    u32;
        volatile int32_t     i32;
        volatile float       f32;
    };
} __attribute__((__packed__)) unaligned_32;

typedef struct {
    union {
        volatile uint64_t    u64;
        volatile int64_t     i64;
        volatile double      f64;
    };
} __attribute__((__packed__)) unaligned_64;

It generates separate (unaligned-allowed) loads even for
typedef struct { unaligned_32 a, b; } MyStruct;
MyStruct example;
int32_t test(void) { return example.a.i32 + example.b.i32; }
which I assume can be used as a litmus test here.

(I have not thought of any real downsides for the volatile-ness. If the same unaligned structure field is used multiple times, I always recommended to use an explicit temporary local variable to hold the value, because optimizing away local temporary variables is a very common practice so the compiler is very good at it, but optimizing multiple accesses to the same member of a structure takes more optimization smarts; better code is generated when one uses an explicit temporary local variable. Plus, if one uses a logically descriptive name for the temporary variable, even the code tends to be easier to maintain that way.)

gf · « **Reply #54 on:** October 04, 2023, 09:18:46 pm »

Quote from: Nominal Animal on October 04, 2023, 06:32:56 pm

Load merging will not occur if the buffer data is volatile, ie. when you have volatile char buffer[64];

Indeed, gcc apparently propagates volatile as a hidden attribute of values from buffer to derived values, so that the load becomes volatile at the end as well, even if the temporary pointer declaration does not contain a volatile keyword. However, I think this kind of propagation not guaranteed by the C standard, so a portable program should possibly not rely on it.

Quote

volatile is the correct qualifier to use in all cases, but one should not be surprised if in certain cases the compiler will generate unaligned non-merged accesses even when volatile is omitted.

Sure, there are various other possible reasons that have nothing to do with volatile.

Quote

Therefore, for unaligned_nn access structures, perhaps I should suggest something like the following, instead of the ones I showed earlier?

It is certainly a way to achieve that any load/store from/to these members has volatile semantics.

Nominal Animal · « **Reply #55 on:** October 05, 2023, 06:44:49 pm »

Quote from: gf on October 04, 2023, 09:18:46 pm

I think this kind of propagation not guaranteed by the C standard, so a portable program should possibly not rely on it.

As even the packed attribute is a C extension, and not part of the C standard, none of this is "guaranteed"; and it basically depends on current compilers' optimization strategies. (I use one of -O2, -Os, -Oz (clang), or -Og, plus any specific optimization featuress I happen to need.)
For GCC use, I would personally add a note in the build documentation about relying on this, and a simple example program for one to verify the machine code generated, when changing GCC major version, or between GCC and clang, for example. Even when I trust, I do try to keep verification easy. Litmus tests with simple functions that generate easily determined machine code patterns are good for this, in my opinion.

As mentioned previously, all of this would be easily solved by a __builtin_unaligned(pointer_expression) built-in, or even an unaligned C qualifier keyword.

I do need unaligned accesses most often when transferring information between other hosts or devices (microcontrollers or computers), and there the accessor function approach with specific endianness does tend to generate near-optimal code. (See this post in a previous thread about 'packed' attribute for a code example.)

For the macro approach –– if that is preferable over the accessor function approach ––, considering the very useful feedback from gf and others, I would suggest something along the following:

Code: [Select]

// SPDX-License-specifier: CC0-1.0
#ifndef  UNALIGNED
#define  UNALIGNED(ptr)  ((_Generic(                             (ptr)                                                   \
                                   ,                unsigned long long *: unaligned_unsigned_long_long_at                \
                                   ,          const unsigned long long *: unaligned_const_unsigned_long_long_at          \
                                   ,       volatile unsigned long long *: unaligned_volatile_unsigned_long_long_at       \
                                   , const volatile unsigned long long *: unaligned_const_volatile_unsigned_long_long_at \
                                   ,                         long long *: unaligned_long_long_at                         \
                                   ,                   const long long *: unaligned_const_long_long_at                   \
                                   ,                volatile long long *: unaligned_volatile_long_long_at                \
                                   ,          const volatile long long *: unaligned_const_volatile_long_long_at          \
                                   ,                     unsigned long *: unaligned_unsigned_long_at                     \
                                   ,               const unsigned long *: unaligned_const_unsigned_long_at               \
                                   ,            volatile unsigned long *: unaligned_volatile_unsigned_long_at            \
                                   ,      const volatile unsigned long *: unaligned_const_volatile_unsigned_long_at      \
                                   ,                              long *: unaligned_long_at                              \
                                   ,                        const long *: unaligned_const_long_at                        \
                                   ,                     volatile long *: unaligned_volatile_long_at                     \
                                   ,               const volatile long *: unaligned_const_volatile_long_at               \
                                   ,                      unsigned int *: unaligned_unsigned_int_at                      \
                                   ,                const unsigned int *: unaligned_const_unsigned_int_at                \
                                   ,             volatile unsigned int *: unaligned_volatile_unsigned_int_at             \
                                   ,       const volatile unsigned int *: unaligned_const_volatile_unsigned_int_at       \
                                   ,                               int *: unaligned_int_at                               \
                                   ,                         const int *: unaligned_const_int_at                         \
                                   ,                      volatile int *: unaligned_volatile_int_at                      \
                                   ,                const volatile int *: unaligned_const_volatile_int_at                \
                                   ,                    unsigned short *: unaligned_unsigned_short_at                    \
                                   ,              const unsigned short *: unaligned_const_unsigned_short_at              \
                                   ,           volatile unsigned short *: unaligned_volatile_unsigned_short_at           \
                                   ,     const volatile unsigned short *: unaligned_const_volatile_unsigned_short_at     \
                                   ,                             short *: unaligned_short_at                             \
                                   ,                       const short *: unaligned_const_short_at                       \
                                   ,                    volatile short *: unaligned_volatile_short_at                    \
                                   ,              const volatile short *: unaligned_const_volatile_short_at              \
                                   ,                     unsigned char *: unaligned_unsigned_char_at                     \
                                   ,               const unsigned char *: unaligned_const_unsigned_char_at               \
                                   ,            volatile unsigned char *: unaligned_volatile_unsigned_char_at            \
                                   ,      const volatile unsigned char *: unaligned_const_volatile_unsigned_char_at      \
                                   ,                       signed char *: unaligned_signed_char_at                       \
                                   ,                 const signed char *: unaligned_const_signed_char_at                 \
                                   ,              volatile signed char *: unaligned_volatile_signed_char_at              \
                                   ,        const volatile signed char *: unaligned_const_volatile_signed_char_at        \
                                   ,                              char *: unaligned_char_at                              \
                                   ,                        const char *: unaligned_const_char_at                        \
                                   ,                     volatile char *: unaligned_volatile_char_at                     \
                                   ,               const volatile char *: unaligned_const_volatile_char_at               \
                                   )(ptr))->value)

#define  PACKED_STRUCT(type)  struct { volatile type value; } __attribute__((__packed__))

typedef  PACKED_STRUCT(const unsigned long long)  unaligned_const_unsigned_long_long_struct;
typedef  PACKED_STRUCT(unsigned long long)        unaligned_unsigned_long_long_struct;
typedef  PACKED_STRUCT(const long long)           unaligned_const_long_long_struct;
typedef  PACKED_STRUCT(long long)                 unaligned_long_long_struct;
typedef  PACKED_STRUCT(const unsigned long)       unaligned_const_unsigned_long_struct;
typedef  PACKED_STRUCT(unsigned long)             unaligned_unsigned_long_struct;
typedef  PACKED_STRUCT(const long)                unaligned_const_long_struct;
typedef  PACKED_STRUCT(long)                      unaligned_long_struct;
typedef  PACKED_STRUCT(const unsigned int)        unaligned_const_unsigned_int_struct;
typedef  PACKED_STRUCT(unsigned int)              unaligned_unsigned_int_struct;
typedef  PACKED_STRUCT(const int)                 unaligned_const_int_struct;
typedef  PACKED_STRUCT(int)                       unaligned_int_struct;
typedef  PACKED_STRUCT(const unsigned short)      unaligned_const_unsigned_short_struct;
typedef  PACKED_STRUCT(unsigned short)            unaligned_unsigned_short_struct;
typedef  PACKED_STRUCT(const short)               unaligned_const_short_struct;
typedef  PACKED_STRUCT(short)                     unaligned_short_struct;
typedef  PACKED_STRUCT(const unsigned char)       unaligned_const_unsigned_char_struct;
typedef  PACKED_STRUCT(unsigned char)             unaligned_unsigned_char_struct;
typedef  PACKED_STRUCT(const signed char)         unaligned_const_signed_char_struct;
typedef  PACKED_STRUCT(signed char)               unaligned_signed_char_struct;
typedef  PACKED_STRUCT(const char)                unaligned_const_char_struct;
typedef  PACKED_STRUCT(char)                      unaligned_char_struct;

#undef   PACKED_STRUCT

#define  HELPER_FUNCTION  __attribute__((__unused__)) static inline volatile

HELPER_FUNCTION        unaligned_unsigned_long_long_struct        *unaligned_unsigned_long_long_at                               (unsigned long long *ref) { return (volatile       unaligned_unsigned_long_long_struct       *)ref; }
HELPER_FUNCTION        unaligned_unsigned_long_long_struct        *unaligned_volatile_unsigned_long_long_at             (volatile unsigned long long *ref) { return (volatile       unaligned_unsigned_long_long_struct       *)ref; }
HELPER_FUNCTION  const unaligned_const_unsigned_long_long_struct  *unaligned_const_unsigned_long_long_at                   (const unsigned long long *ref) { return (volatile const unaligned_const_unsigned_long_long_struct *)ref; }
HELPER_FUNCTION  const unaligned_const_unsigned_long_long_struct  *unaligned_const_volatile_unsigned_long_long_at (const volatile unsigned long long *ref) { return (volatile const unaligned_const_unsigned_long_long_struct *)ref; }

HELPER_FUNCTION        unaligned_long_long_struct        *unaligned_long_long_at                               (long long *ref) { return (volatile       unaligned_long_long_struct       *)ref; }
HELPER_FUNCTION        unaligned_long_long_struct        *unaligned_volatile_long_long_at             (volatile long long *ref) { return (volatile       unaligned_long_long_struct       *)ref; }
HELPER_FUNCTION  const unaligned_const_long_long_struct  *unaligned_const_long_long_at                   (const long long *ref) { return (volatile const unaligned_const_long_long_struct *)ref; }
HELPER_FUNCTION  const unaligned_const_long_long_struct  *unaligned_const_volatile_long_long_at (const volatile long long *ref) { return (volatile const unaligned_const_long_long_struct *)ref; }

HELPER_FUNCTION        unaligned_unsigned_long_struct        *unaligned_unsigned_long_at                               (unsigned long *ref) { return (volatile       unaligned_unsigned_long_struct       *)ref; }
HELPER_FUNCTION        unaligned_unsigned_long_struct        *unaligned_volatile_unsigned_long_at             (volatile unsigned long *ref) { return (volatile       unaligned_unsigned_long_struct       *)ref; }
HELPER_FUNCTION  const unaligned_const_unsigned_long_struct  *unaligned_const_unsigned_long_at                   (const unsigned long *ref) { return (volatile const unaligned_const_unsigned_long_struct *)ref; }
HELPER_FUNCTION  const unaligned_const_unsigned_long_struct  *unaligned_const_volatile_unsigned_long_at (const volatile unsigned long *ref) { return (volatile const unaligned_const_unsigned_long_struct *)ref; }

HELPER_FUNCTION        unaligned_long_struct        *unaligned_long_at                               (long *ref) { return (volatile       unaligned_long_struct       *)ref; }
HELPER_FUNCTION        unaligned_long_struct        *unaligned_volatile_long_at             (volatile long *ref) { return (volatile       unaligned_long_struct       *)ref; }
HELPER_FUNCTION  const unaligned_const_long_struct  *unaligned_const_long_at                   (const long *ref) { return (volatile const unaligned_const_long_struct *)ref; }
HELPER_FUNCTION  const unaligned_const_long_struct  *unaligned_const_volatile_long_at (const volatile long *ref) { return (volatile const unaligned_const_long_struct *)ref; }

HELPER_FUNCTION        unaligned_unsigned_int_struct        *unaligned_unsigned_int_at                               (unsigned int *ref) { return (volatile       unaligned_unsigned_int_struct       *)ref; }
HELPER_FUNCTION        unaligned_unsigned_int_struct        *unaligned_volatile_unsigned_int_at             (volatile unsigned int *ref) { return (volatile       unaligned_unsigned_int_struct       *)ref; }
HELPER_FUNCTION  const unaligned_const_unsigned_int_struct  *unaligned_const_unsigned_int_at                   (const unsigned int *ref) { return (volatile const unaligned_const_unsigned_int_struct *)ref; }
HELPER_FUNCTION  const unaligned_const_unsigned_int_struct  *unaligned_const_volatile_unsigned_int_at (const volatile unsigned int *ref) { return (volatile const unaligned_const_unsigned_int_struct *)ref; }

HELPER_FUNCTION        unaligned_int_struct        *unaligned_int_at                               (int *ref) { return (volatile       unaligned_int_struct       *)ref; }
HELPER_FUNCTION        unaligned_int_struct        *unaligned_volatile_int_at             (volatile int *ref) { return (volatile       unaligned_int_struct       *)ref; }
HELPER_FUNCTION  const unaligned_const_int_struct  *unaligned_const_int_at                   (const int *ref) { return (volatile const unaligned_const_int_struct *)ref; }
HELPER_FUNCTION  const unaligned_const_int_struct  *unaligned_const_volatile_int_at (const volatile int *ref) { return (volatile const unaligned_const_int_struct *)ref; }

HELPER_FUNCTION        unaligned_unsigned_short_struct        *unaligned_unsigned_short_at                               (unsigned short *ref) { return (volatile       unaligned_unsigned_short_struct       *)ref; }
HELPER_FUNCTION        unaligned_unsigned_short_struct        *unaligned_volatile_unsigned_short_at             (volatile unsigned short *ref) { return (volatile       unaligned_unsigned_short_struct       *)ref; }
HELPER_FUNCTION  const unaligned_const_unsigned_short_struct  *unaligned_const_unsigned_short_at                   (const unsigned short *ref) { return (volatile const unaligned_const_unsigned_short_struct *)ref; }
HELPER_FUNCTION  const unaligned_const_unsigned_short_struct  *unaligned_const_volatile_unsigned_short_at (const volatile unsigned short *ref) { return (volatile const unaligned_const_unsigned_short_struct *)ref; }

HELPER_FUNCTION        unaligned_short_struct        *unaligned_short_at                               (short *ref) { return (volatile       unaligned_short_struct       *)ref; }
HELPER_FUNCTION        unaligned_short_struct        *unaligned_volatile_short_at             (volatile short *ref) { return (volatile       unaligned_short_struct       *)ref; }
HELPER_FUNCTION  const unaligned_const_short_struct  *unaligned_const_short_at                   (const short *ref) { return (volatile const unaligned_const_short_struct *)ref; }
HELPER_FUNCTION  const unaligned_const_short_struct  *unaligned_const_volatile_short_at (const volatile short *ref) { return (volatile const unaligned_const_short_struct *)ref; }

HELPER_FUNCTION        unaligned_unsigned_char_struct        *unaligned_unsigned_char_at                               (unsigned char *ref) { return (volatile       unaligned_unsigned_char_struct       *)ref; }
HELPER_FUNCTION        unaligned_unsigned_char_struct        *unaligned_volatile_unsigned_char_at             (volatile unsigned char *ref) { return (volatile       unaligned_unsigned_char_struct       *)ref; }
HELPER_FUNCTION  const unaligned_const_unsigned_char_struct  *unaligned_const_unsigned_char_at                   (const unsigned char *ref) { return (volatile const unaligned_const_unsigned_char_struct *)ref; }
HELPER_FUNCTION  const unaligned_const_unsigned_char_struct  *unaligned_const_volatile_unsigned_char_at (const volatile unsigned char *ref) { return (volatile const unaligned_const_unsigned_char_struct *)ref; }

HELPER_FUNCTION        unaligned_signed_char_struct        *unaligned_signed_char_at                               (signed char *ref) { return (volatile       unaligned_signed_char_struct       *)ref; }
HELPER_FUNCTION        unaligned_signed_char_struct        *unaligned_volatile_signed_char_at             (volatile signed char *ref) { return (volatile       unaligned_signed_char_struct       *)ref; }
HELPER_FUNCTION  const unaligned_const_signed_char_struct  *unaligned_const_signed_char_at                   (const signed char *ref) { return (volatile const unaligned_const_signed_char_struct *)ref; }
HELPER_FUNCTION  const unaligned_const_signed_char_struct  *unaligned_const_volatile_signed_char_at (const volatile signed char *ref) { return (volatile const unaligned_const_signed_char_struct *)ref; }

HELPER_FUNCTION        unaligned_char_struct        *unaligned_char_at                               (char *ref) { return (volatile       unaligned_char_struct       *)ref; }
HELPER_FUNCTION        unaligned_char_struct        *unaligned_volatile_char_at             (volatile char *ref) { return (volatile       unaligned_char_struct       *)ref; }
HELPER_FUNCTION  const unaligned_const_char_struct  *unaligned_const_char_at                   (const char *ref) { return (volatile const unaligned_const_char_struct *)ref; }
HELPER_FUNCTION  const unaligned_const_char_struct  *unaligned_const_volatile_char_at (const volatile char *ref) { return (volatile const unaligned_const_char_struct *)ref; }

#undef  HELPER_FUNCTION

#endif /* UNALIGNED() */

which requires C11 _Generic support and packed attribute support to work. It generates no global or static symbols or functions, and generates no machine code if not used. Specifically, the __attribute__((unused)) attribute and static qualifier should tell the compiler that these helper functions are local to the current compilation unit (C source file) and can/should be inlined, but no code nor any warning needs to be generated if they are not used.

It exports an UNALIGNED(pointer_expression) macro, which dereferences the supplied pointer expression using the pointer target type, without requiring the pointer to be properly aligned. It can be used for assignments also. For example:
struct { int a, b; } foo;
int foo_sum(void) { return UNALIGNED(&foo.a) + UNALIGNED(&foo.b); }
int foo_clear(void) { UNALIGNED(&foo.a) = 0; UNALIGNED(&foo.b) = 0; }
will generate separate ldr and str instructions for the two fields on Cortex-M3.

Note that it should support the proper qualifiers, i.e. with
const struct { int a, b; } foo = { .a = 1, .b = 2 };
or with
const struct { const int a; int b; } foo = { .a = 1, .b = 2 };
foo_sum() will work, but foo_clear() fail at compile time (because of trying to modify a read-only object).

It also works for the
int test1(char *buf) { return UNALIGNED((int *)(buf + 16)) + UNALIGNED((int *)(buf + 20)); }
and
char buffer[32];
int test2(void) { return UNALIGNED((int *)(buffer + 16)) + UNALIGNED((int *)(buffer + 20)); }
cases, generating separate loads for each int.

For current C compilers, the fixed-width types defined in <stdint.h> are based on the standard types, so the above should work for those types also.
They are trivial to add on afterwards. They cannot be added in just in case, because duplicate type specifications in the _Generic statement causes warnings/errors. Similarly, adding float, double, and long double floating-point type support should be straightforward.

(Each additional type adds four lines (type:, const type:, volatile type:, and const volatile type:) into the _Generic, two PACKED_STRUCT structures (one const, the other non-const), and four HELPER_FUNCTION helper functions.)

Note that the char type is special, because signed char is not necessarily the same type as char. For short, int, long (= long int), and long long (= long long int), signed itype is the same as itype.

SiliconWizard · « **Reply #56 on:** October 05, 2023, 07:11:21 pm »

Quote from: Nominal Animal on October 05, 2023, 06:44:49 pm

Quote from: gf on October 04, 2023, 09:18:46 pm
I think this kind of propagation not guaranteed by the C standard, so a portable program should possibly not rely on it.
As even the packed attribute is a C extension, and not part of the C standard, none of this is "guaranteed"; and it basically depends on current compilers' optimization strategies. (I use one of -O2, -Os, -Oz (clang), or -Og, plus any specific optimization featuress I happen to need.)

Yep. It is not standard, and thus not portable.

Quote from: Nominal Animal on October 05, 2023, 06:44:49 pm

For GCC use, I would personally add a note in the build documentation about relying on this,

Absolutely, something I have recommended as well. Document whatever part of your code is not portable, why, and how to port it if needed.

As I said, I personally try to avoid unaligned accesses as much as possible in general, rather than having to rely on compiler specifics. In other words, the main case where I would use packed structs is when I can guarantee the natural alignment of all members - which obviously is only if I can control the layout of said structs.

When dealing with packed structs with unaligned members, I'll usually use a memcpy() approach or similar, rather than compiler "tricks" to handle unaligned accesses properly.
But of course, like all rules, there are exceptions, and in this case your suggestions above look useful.

Nominal Animal · « **Reply #57 on:** October 05, 2023, 08:39:37 pm »

Quote from: SiliconWizard on October 05, 2023, 07:11:21 pm

When dealing with packed structs with unaligned members, I'll usually use a memcpy() approach or similar, rather than compiler "tricks" to handle unaligned accesses properly.

Yep; as I said, for multi-byte integer values I tend to use accessor functions that take an unsigned char buffer, and calculate the multi-byte value explicitly. While they do not always generate optimal or even near-optimal code, they never bug out on me.

The macro approach is an alternative to those who refuse to use or cannot use inline accessor functions for some reason.

My above macro suggestion does "fail" for RISC-V (rv32gc) using clang, whenever the compiler can determine at compile time the address will be properly aligned.

Renaming the macro as MAYBE_UNALIGNED() would make sense, because then it intuitively works correctly even on rv32gc: it dereferences the specified pointer expression to the pointed-to value (if it is one of the base integer types, currently) without combining loads, using unaligned accesses unless the compiler can infer the pointer is aligned at compile time.

That is,
struct { int a, b; } foo;
int foo_sum(void) { return UNALIGNED(&foo.a) + UNALIGNED(&foo.b); }
will generate aligned loads on rv32gc when compiled with clang. Even
typedef struct { int a, b; } __attribute__((packed)) foo;
int foo_sum(foo *f) { return UNALIGNED(&(f->a)) + UNALIGNED(&(f->b)); }
foo f;
int f_sum(void) { return foo_sum(&f); }
generates aligned loads for f_sum(). Of course, foo_sum() when called from a different compilation unit (and whenever the compier cannot determine if f is aligned or not), will construct the value byte-by-byte, as one would expect.

This should not be an issue for real world code, though. Even for rv32gc, the typical use case, e.g.
int test(char *buf) { return UNALIGNED((int *)(buf + 16)) + UNALIGNED((int *)(buf + 20)); }
generates proper unaligned loads (byte by byte). Analogously, for
char buffer1[32] __attribute__((aligned (1)));
char buffer2[32] __attribute__((aligned (2)));
char buffer4[32] __attribute__((aligned (4)));
int sum1(void) { return UNALIGNED((int *)(buffer1 + 16)) + UNALIGNED((int *)(buffer1 + 20)); }
int sum2(void) { return UNALIGNED((int *)(buffer2 + 16)) + UNALIGNED((int *)(buffer2 + 20)); }
int sum4(void) { return UNALIGNED((int *)(buffer4 + 16)) + UNALIGNED((int *)(buffer4 + 20)); }
function sum1() generates eight byte loads, sum2() generates four halfword loads, and sum4() two word loads, because clang can determine the alignment (within the same compilation unit) at compile time here.

The macro definition is thus quite useful even on rv32gc, because by casting some pointer expression into a pointer to int, clang assumes you know the result will be aligned to a four-byte boundary. That is, given
int get_int(const volatile void *ptr) { return *(const volatile int *)ptr; }
clang will generate a single load word instruction on rv32gc, but
int get_int(const volatile void *ptr) { return UNALIGNED((const volatile int *)ptr); }
clang will load the value using four byte loads (again, unless it can infer at compile time that ptr is aligned to a 2-byte or 4-byte boundary).
Which is generally what we programmers want.

gf · « **Reply #58 on:** October 05, 2023, 09:49:27 pm »

Quote from: Nominal Animal on October 05, 2023, 06:44:49 pm

As mentioned previously, all of this would be easily solved by a __builtin_unaligned(pointer_expression) built-in,

Analog to __builtin_assume_aligned(), the unaligned property of values would, of course, still be lost at boundaries that are not (or cannot be) crossed by the data flow analysis (such as function boundaries of non-inlined functions).

Quote

or even an unaligned C qualifier keyword.

You mean as a type qualifier like const, volatile, restrict?

Quote from: SiliconWizard on October 05, 2023, 07:11:21 pm

When dealing with packed structs with unaligned members, I'll usually use a memcpy() approach or similar, rather than compiler "tricks" to handle unaligned accesses properly.

The memcpy() approach is only required with regular structs, having an alignment > 1. Packed structs work out of the box with unaligned pointer values. The alignment of packed structs, as well as the alignment of their members, is by definition 1. Just access any member (e.g. p->member1, where p is a pointer to a packed struct), and the access is considered unaligned by the compiler anyway. No "tricks" required besides declaring the struct with __attribute__((packed)).

[ Nevertheless, if the compiler can prove at compile time that a particular unaligned access can safely be replaced by an aligned one, it is free to optimize -- as with all other optimizations. If the compiler cannot prove, it must no optimize. You should not need to care whether it does -- the result should be the same. ]

However, one thing is dangerous: Don't take the address of a packed struct's member of type T and assign it to a T* pointer!
Subsequently, it is not safe to dereference this T* pointer (unless T is a character type, or again a packed struct). Don't ignore the warning

Quote

warning: taking address of packed member of 'struct <anonymous>' may result in an unaligned pointer value [-Waddress-of-packed-member]

Nominal Animal · « **Reply #59 on:** October 06, 2023, 12:00:14 am »

Quote from: gf on October 05, 2023, 09:49:27 pm

Analog to __builtin_assume_aligned(), the unaligned property of values would, of course, still be lost at boundaries that are not (or cannot be) crossed by the data flow analysis (such as function boundaries of non-inlined functions).

True. In my own use cases, unaligned-ness is a property of the access I do, and not a property of a variable or type.

That is, I never need to mark a variable unaligned; I only need to mark certain pointers as unaligned just before I dereference them.

Quote from: gf on October 05, 2023, 09:49:27 pm

Quote
or even an unaligned C qualifier keyword.
You mean as a type qualifier like const, volatile, restrict?

Yes.

Quote from: gf on October 05, 2023, 09:49:27 pm

Packed structs work out of the box with unaligned pointer values.

Yes, in the sense that
typedef struct { int a, b; } __attribute__((packed)) mystruct;
int mystruct_sum(mystruct *m) { return m->a + m->b; }
will generate unaligned loads.

But, as you indicated, if you then add
mystruct foo;
int foo_sum(void) { return mystruct_sum(&foo); }
both gcc and clang use aligned loads in foo_sum() inlining the mystruct_sum() call, and on Cortex-M3, depending on optimization and compiler, usually combines the loads (LDRD or LDRM). Replacing (mystruct *m) above with (volatile mystruct *m), will stop the combining, due to C rules wrt. volatile object accesses.

Thus, there are two separate details here: one is alignment (or lack thereof), and another is load/store combining on architectures like Cortex-M3.
Using the packed attribute on the structure allows safe unaligned access to members of that structure.
Using the volatile qualifier for the accessed object (structure and/or member) stops load/store combining.

Quote from: gf on October 05, 2023, 09:49:27 pm

However, one thing is dangerous: Don't take the address of a packed struct's member of type T and assign it to a T* pointer!

This is very important!

When you cast an expression into a pointer to some type T, you also promise that it is aligned sufficiently for that type T.
When you take the address of a member in a packed structure, it may not be sufficiently aligned for a pointer to that type!

Casting a pointer expression to a pointer to char (or unsigned char or signed char) is special, as it is the way to access the object's storage representation, the in-memory data corresponding to the value of that object.

gf · « **Reply #60 on:** October 06, 2023, 09:25:52 am »

Quote from: Nominal Animal on October 06, 2023, 12:00:14 am

Thus, there are two separate details here: one is alignment (or lack thereof), and another is load/store combining on architectures like Cortex-M3.
Using the packed attribute on the structure allows safe unaligned access to members of that structure.
Using the volatile qualifier for the accessed object (structure and/or member) stops load/store combining.

Exactly, these are separate issues. But one should not have to care explicitly about load/store combining at all. It's just an optimization. If the compiler combines unexpectedly, then the program is already wrong with respect to semantics and guarantees provided by the language (or language extension). What I mean is, one should not think in terms of generated assembly and how you can outwit the compiler to generate the desired assembly code, but rather think in terms of C semantics (and only C semantics) in the fist place. The compiler will (hopefully - assuming no bugs) do the right thing when it maps the abstract C semantics of the program to machine code.

Quote

Quote from: gf on October 05, 2023, 09:49:27 pm
However, one thing is dangerous: Don't take the address of a packed struct's member of type T and assign it to a T* pointer!
This is very important!

When you cast an expression into a pointer to some type T, you also promise that it is aligned sufficiently for that type T.
When you take the address of a member in a packed structure, it may not be sufficiently aligned for a pointer to that type!

The pitfall is, of course, if the type of the member is (say) int and the pointer type is int*, then no type cast is required, and at the first glance you might think the assignment is fine, although it is not. Luckily, gcc issues a warning.

Nominal Animal · « **Reply #61 on:** October 06, 2023, 04:20:44 pm »

Quote from: gf on October 06, 2023, 09:25:52 am

What I mean is, one should not think in terms of generated assembly and how you can outwit the compiler to generate the desired assembly code, but rather think in terms of C semantics (and only C semantics) in the fist place. The compiler will (hopefully - assuming no bugs) do the right thing when it maps the abstract C semantics of the program to machine code.

Absolutely true.

The main case where load combining would be an issue is with specific memory-mapped peripherals where certain registers need to be accessed in a specific order. There, one should use volatile anyway –– for example, for the packed structure that represents the register fields, or its members –– which stops the load/store combining.

The other cases I can think of –– spinlocks, generation counters, etc. –– are either similarly solved by volatile, or in rare extreme cases –– locking, or lockless access primitives, based on ll/sc (load-linked, store-conditional) where exact machine instruction sequences are needed –– with gcc-compatible architecture-specific extended inline assembly. Unlike pure assembly sources, when written as static inline accessor functions, using proper machine constraints, these optimize extremely well into their surrounding C code.

Quote from: gf on October 06, 2023, 09:25:52 am

Luckily, gcc issues a warning.

Yes, which is one more reason to always enable warnings by default when writing code!

DiTBho · « **Reply #62 on:** October 07, 2023, 10:43:13 am »

Quote from: Nominal Animal on October 06, 2023, 04:20:44 pm

Quote from: gf on October 06, 2023, 09:25:52 am
What I mean is, one should not think in terms of generated assembly and how you can outwit the compiler to generate the desired assembly code, but rather think in terms of C semantics (and only C semantics) in the fist place. The compiler will (hopefully - assuming no bugs) do the right thing when it maps the abstract C semantics of the program to machine code.
Absolutely true.

but that's why thread cannot be safely implemented as C-library, neither you can assume C semantics will produce the right assembly code for trmem, and the more you push { space, speed} optimization, the more likely it won't

That's why I always suggest to directly use assembly for this stuff: because you have the full controll of it!

With my-c you are always guaranteed to get the correct assembly code, but!

but it's only aimed at MIPS5++(2) and doesn't consider other architectures, and I'm sure if I started supporting ARM Cortex things would start to get worse
heavily uses monads and monad semantics to describe the desired behavior(1)
which usually means robust and protable code, and right assembly, but excessive glue code
which means slow performance and large final binary file size

if you think about it, my-c works better precisely because the optimizer works worse

Actually, to be honest, the my-C optimizer does almost nothing, and it's a great thing for me as you don't even have to care about the problems for which in C you often have to use "volatile" to ensure that the optimizer won't "asphalt your code" like a crushes stones vehicle driven by a monkey.

(1) so, "alignment" is entirely handled by monads, you can create one that takes this into account, you can define a type that does this and use it with a simple typedef. No special pragma or compiler-specific magic directive is needed, and even the pun is handled by monad operators, so exceptions are again handled by monads. Everything is monad oriented, so you can do everything as long as you can define it with monad semantic.

In fact, if you want you can also define monads for "casting", instead of using wilder but very fast "unchecked conversion" (unsafe, but 10x faster!!!) methods.

So you have code that reacts automatically and in the way you want it to react when it accidentally passes a NULL pointer, or something for which in C you would definitely get an "undefine behavior". And if you really don't know what to do, you can create a simple monad that points to a panic() in case things don't look right.

(2) with a clever use for specific instructions to access data out of alignment, access data on shared memory, access data on transactional memory, directly manage the cache, directly manage the pipeline etc

Nominal Animal · « **Reply #63 on:** October 07, 2023, 08:09:24 pm »

Quote from: DiTBho on October 07, 2023, 10:43:13 am

but that's why thread cannot be safely implemented as C-library, neither you can assume C semantics will produce the right assembly code for trmem, and the more you push { space, speed} optimization, the more likely it won't

That's why I always suggest to directly use assembly for this stuff: because you have the full controll of it!

If we define 'this stuff' as these hardware-details (transactional memory, spinlocks, ll/sc-based locks or lockless accessors), we are in absolute agreement.

With GCC and clang, correctly written extended inline assembly will use machine constraints and "references" for the registers used, so that the C compiler can optimize the (inlined) code for each use site. For full functions that will always be called and not inlined, external assembly is absolutely fine.

Quote from: DiTBho on October 07, 2023, 10:43:13 am

If you think about it, my-c works better precisely because the optimizer works worse

This is a key observation. The more limited or stricter the optimization strategy for a C-like language, the better the control over the exact code generated, but the worse the portability (especially wrt. compilers generating code for a different hardware achitecture, from the same source) becomes.

While "new" languages like Rust, Julia, etc. are developed to hopefully avoid the core concepts of C that cause that effect, only time will tell, really.

Quote from: DiTBho on October 07, 2023, 10:43:13 am

Actually, to be honest, the my-C optimizer does almost nothing, and it's a great thing for me as you don't even have to care about the problems for which in C you often have to use "volatile" to ensure that the optimizer won't "asphalt your code" like a crushes stones vehicle driven by a monkey.

Have you noticed how I always accompany my suggested code snippets including volatile with a specific explanation along the lines of "it stops the compiler from caching and inferring the value from surrounding code, as the value of such variables can change or be modified by code not seen by the compiler", exactly because it is such a heavy hammer? It is way too common for C programmers to simply sprinkle them on variables semi-randomly, until the code seems to work; the often described 'lets throw spaghetti at the wall to see what sticks' -approach.

Base C is a very simple language with a very complex optimization engines bolted on top, to use it effectively and to write portable efficient code, one needs to understand a lot about the language, its theoretical model (the abstract machine the language specification used), as well as existing machine architectures and their differences. One of my pet peeves is the ubiquitous opendir()/readdir()/closedir() example/exercise/use case, which is wrong on most current operating systems, because the directory tree may be modified during scanning, and none of the existing examples take that into account. The proper solution is to use POSIX scandir(), glob(), or nftw(), or fts family of functions from BSD and derivatives, which are supposed to work even when the directory tree is concurrently modified. To implement these, you need to either use a helper process (fts family, using its current working directory to walk the directory tree), or so-called "atfile" support (as e.g. standardized in POSIX via openat(), fstatat(), etc.). Exactly why, involves understanding how file systems are implemented, and their access properties (what is atomic, what is not, and so on).

With experience, that understanding distills into rules of thumb –– like using memcpy() or accessor functions, instead of 'tricks' like the UNALIGNED() macros I showed above, to ensure correct machine instruction level access patterns ––, often ending up "codified" in programming howtos and guides and books; but when used without the true understanding, easily leads to misuse and inefficient/buggy code.

A good example of that is when using low-level POSIX/Unix/BSD I/O from <unistd.h>, i.e. read() and write(). Ages ago, operating systems never returned short counts for normal files. This belief still persists today, even though it is absolutely false. First, on POSIXy systems a signal delivery to an userspace handler installed without the SA_RESTART flag can cause them to fail with errno==EINTR; some filesystems, like Linux userspace (FUSE) ones, can return short reads or writes whenever they want, even for local files; slow network connections can cause short reads from shared network folders; and pipes and sockets often return short reads or writes. I've had dozens of arguments about this with otherwise very proficient C programmers, with their argument basically boiling down to "that doesn't happen to me, so I don't care".

As to security aspects, don't even get me started.

This leads to an annoying dichotomy on my own part. With threads like this one, where a specific detail is discussed, I do not usually even consider whether there are real use cases for applying the detail or not; I just discuss what I know about it, because I tend to suspect there is an use case, or the OP would have discussed the problem they're trying to solve via that detail. With threads like this one about opendir()/chdir() on this forum, my response will be severely annoying (sorry, MikeK) even if/when they are explicitly useful/correct. See my "original" answer to that question in 2015 at StackOverflow, read the comments, and note how it was not the correct answer to the asker. To me, it really feels like seeing babies draw on the kitchen cabinets with their own poop.

I suspect that something like that dichotomy, mixed with experience that goes beyond the book examples and single architectures, and experience that is based on the book examples and having found that sufficient, is the underlying reason why so many threads about C details branch out and get a bit 'flame-y'.
Now, add to that useful pieces from domain- and hardware-specific variants of C like DiTBho's my-c, or my own that makes arrays base-level objects (allowing buffer overrun detection and tracking at compile time through function hierarchies), and conflagration is nearly assured.

Add to that the high-level concepts like monads (or even threads!) that can be used to sidestep many of the issues in real-world code, and the discussion will vary from friendly to heated, from practical to theoretical, and so on. I for one like to try and be useful, and find all of those aspects interesting, but that leads to walls of text like this post. $:-\$

Apologies for this and the preceding over-long posts. I'm still trying to learn how to be more concise.

ejeffrey · « **Reply #64 on:** October 09, 2023, 02:03:17 pm »

Quote from: DiTBho on October 07, 2023, 10:43:13 am

Actually, to be honest, the my-C optimizer does almost nothing, and it's a great thing for me as you don't even have to care about the problems for which in C you often have to use "volatile" to ensure that the optimizer won't "asphalt your code" like a crushes stones vehicle driven by a monkey.

I guess that can be helpful in some situations but storing values in registers is a pretty important and fundamental optimization. Omitting that's is fine for something that's basically a "low level scripting language" where you don't really care about performance, but it's not great for a general purpose language. You can definitely have a "register" keyword that means the opposite of volatile and use it for the most performance critical values but I think you end up with the same problem.

Also if you want to run on multi core systems (and maybe you don't if the whole premise is that performance isn't a goal) this doesn't really help you for all the reasons that C programmers discovered in the 1990s and 2000s.

DiTBho · « **Reply #65 on:** October 09, 2023, 02:58:08 pm »

Quote from: ejeffrey on October 09, 2023, 02:03:17 pm

I guess that can be helpful in some situations but storing values in registers is a pretty important and fundamental optimization

my-c is helpful as in my opinion monades avoid you to pass through weird confusing keywords ("register" is deprecated, d'oh) or "magic" flags (Gcc) that you have to pass among the various compiler flags, or worse with pragmas.

The datatype encloses the behavior, which in turn models the code in the machine layer without further steps. It is the monad that literally dictates the assembly code, and no one can alter what it dictates.

So, in my-c what you write as HL code is exactly what you get as assembly, and you don't even need to constantly peek at the assembly spit out by cc1, since - this was/is my point - the optimization layer does not distort what you expect the code to do, it does not "unroll loops", it does not "replace anything with memcopy", it never allows itself to remove dead code, nor to eliminate a loop simply because it assumes that the loop condition is always False, it does absolutely none of this, it does nothing that has not been explicitly requested through a meticulous behavioral description that passes through the datatype!

Which means that if you want a variable to be handled not on the stack but in a register, you have to declare it with a monad that does exactly that, and what you get is in the order

verify that what you ask for is possible, and if it is not, the compiler refuses to compile, specifying in a concise (for now barely intelligible) way why you cannot get what you ask for
the implementation of what you expressly requested!

what you write is what you get: { nothing | job done }

(I think this is the dream HL compiler
of every hardware developer
who finds themselves writing software...)

This whole project was born precisely because I have to support a multicore MIPS5++, of which I have to be sure that the things I write in HL are implemented exactly as requested.
(this, because the hardware is really bizarre ...)


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

EEVblog Electronics Community Forum

Author Topic: Cortex-M3+ GCC unaligned access quirks (Read 6442 times)

gf

Re: Cortex-M3+ GCC unaligned access quirks

Nominal Animal

Re: Cortex-M3+ GCC unaligned access quirks

gf

Re: Cortex-M3+ GCC unaligned access quirks

Nominal Animal

Re: Cortex-M3+ GCC unaligned access quirks

gf

Re: Cortex-M3+ GCC unaligned access quirks

Nominal Animal

Re: Cortex-M3+ GCC unaligned access quirks

SiliconWizard

Re: Cortex-M3+ GCC unaligned access quirks

Nominal Animal

Re: Cortex-M3+ GCC unaligned access quirks

gf

Re: Cortex-M3+ GCC unaligned access quirks

Nominal Animal

Re: Cortex-M3+ GCC unaligned access quirks

gf

Re: Cortex-M3+ GCC unaligned access quirks

Nominal Animal

Re: Cortex-M3+ GCC unaligned access quirks

DiTBho

Re: Cortex-M3+ GCC unaligned access quirks

Nominal Animal

Re: Cortex-M3+ GCC unaligned access quirks

ejeffrey

Re: Cortex-M3+ GCC unaligned access quirks

DiTBho

Re: Cortex-M3+ GCC unaligned access quirks

Share me