Electronics > Microcontrollers

This code is so broken it some how fixes itself.

<< < (6/8) > >>


--- Quote from: cv007 on February 01, 2023, 02:43:45 pm ---Obviously lots of options and whether this one is better or not, I like to eliminate casting if there is a better option. In this case I would rather deal with the union than to have to (re)think about how the underlying data fits together when casting.

--- End quote ---

True.  I was looking into unions for something else yesterday I can see how they can help with the constant duplicity of their types between signed 16 and unsigned 32.

It also depends on how the compiler optimises things.

That part of the code will change, as I have to try 24bit and you can't just "cast" right justified signed 24bit as easily as 16.


--- Quote ---I have to try 24bit and you can't just "cast" right justified signed 24bit as easily as 16.
--- End quote ---

--- Code: ---typedef union {
    volatile uint32_t dr_u32;
    volatile int16_t eq_i16;
    volatile int32_t eq_i24 : 24;
} sai_buffer_t;
--- End code ---

When used, the compiler will sign extend the eq_i24 to an int32_t.

could also have a single member which would work the same for i16/i24, each sign extending to i32 when used-

--- Code: ---static const unsigned SAI_DATA_BITS = 16;

union sai_buffer_t {
    volatile uint32_t dr_u32;
    volatile int32_t data : SAI_DATA_BITS;

--- End code ---

Nominal Animal:
I habitually use

--- Code: ---typedef struct {
    union {
        double          d;
        float           f;
        uint64_t        u64;
        int64_t         i64;
        uint32_t        u32[2];
        int32_t         i32[2];
        uint16_t        u16[4];
        int16_t         i16[4];
        uint8_t         u8[8];
        int8_t          i8[8];
        unsigned char   uc[8];
        signed char     sc[8];
        char            c[8];
} word64;

typedef struct {
    union {
        float           f;
        uint32_t        u32;
        int32_t         i32;
        uint16_t        u16[2];
        int16_t         i16[2];
        uint8_t         u8[4];
        int8_t          i8[4];
        unsigned char   uc[4];
        signed char     sc[4];
        char            c[4];
} word32;

--- End code ---
to examine the various parts of data elements, and reinterpret the storage as another type.  For example, ((word32){ .f = value }).u32 returns the 32-bit storage representation of float value (and the code kinda-sorta assumes IEEE 754 Binary64 and Binary32 floating point formats).
When optimized, the machine code simplifies either to nothing (ARMv7-a with hardware floating point support) or a single register move (x86-64), too.

If you have a buffer with 24-bit signed integer or fixed-point samples, you can easily expand it to 32-bit signed integer or fixed point using

--- Code: ---static inline void i24_to_i32_quad(int32_t *dst, const uint32_t src0, const uint32_t src1, const uint32_t src2)
    *(dst++) = ((int32_t)( src0 << 8 )) >> 8;
    *(dst++) = ((int32_t)( (src1 << 16) | ((src0 >> 24) << 8) )) >> 8;
    *(dst++) = ((int32_t)( ((src1 >> 16) << 8) | (src2 << 24) )) >> 8;
    *(dst++) = ((int32_t)( src2 )) >> 8;

void i24_to_i32(int32_t *buf, const size_t count)
    const size_t  quads = (count / 4) + !!(count & 3);

    const uint32_t *src = (uint32_t *)buf + 3*quads;
    int32_t *dst = buf + 4*quads;

    while (dst > buf) {
        dst -= 4;
        src -= 3;
        i24_to_i32_quad(dst, src[0], src[1], src[2]);

--- End code ---
but the buffer size must be padded with zeroes to a multiple of 16 bytes (or count a multiple of 4), and have room for 4 bytes per 24-bit value.  As it progresses backwards from end of array to beginning of array, the conversion is done in-place, in units of 12 bytes in, 16 bytes out.  This does do sign extension correctly, too.

Note that it does expect (negative integer) >> N to be arithmetic shift right; it is on GCC and Clang.

With GCC and Clang, one can also use

--- Code: ---typedef struct {
    union {
       unsigned int  u24:24;
       int           i24:24;
       uint8_t       u8[3];
       int8_t        i8[3];
       char          c[3];
    } __attribute__((packed));
} __attribute__ ((packed)) word24;

typedef struct {
    union {
        uint32_t    u32[3];
        int32_t     i32[3];
        word24      w24[4];
        uint16_t    u16[6];
        int16_t     i16[6];
        uint8_t     u8[12];
        int8_t      i8[12];
        char        c[12];
} word24x4;

void unpack_word24x4(int32_t out[4], const uint32_t in[3])
    const word24x4  temp = { .u32 = { in[0], in[1], in[2] } };
    out[0] = temp.w24[0].i24;
    out[1] = temp.w24[1].i24;
    out[2] = temp.w24[2].i24;
    out[3] = temp.w24[3].i24;

--- End code ---
but this relies on the packed type attribute limiting the size of the word24 type to exactly three bytes, with byte alignment; this is true for GCC and Clang, but is not "standard" C.  (Plus right shift with signed integer types being arithmetic shift right.)

As shown, unpack_word24x4() takes 12 bytes (in three 32-bit unsigned integers), and expands them to four signed 32-bit integers (with sign extension as expected).

Note that i24_to_i32_quad() is faster than unpack_word24x4() on 32-bit ARMv7 at least, if one examines the code at Compiler Explorer.
That is,

--- Code: ---void unpack_word24x4(int32_t out[4], const uint32_t in[3])
    out[0] = ((int32_t)( in[0] << 8 )) >> 8;
    out[1] = ((int32_t)( (in[1] << 16) | ((in[0] >> 24) << 8) )) >> 8;
    out[2] = ((int32_t)( ((in[1] >> 16) << 8) | (in[2] << 24) )) >> 8;
    out[3] = ((int32_t)( in[2] )) >> 8;

--- End code ---
is faster but harder to maintain than the other version above, on ARMv7 at least; both do yield the exact same results.

Most of that went over my head.  However, the basic filters / processing I am using are only there because they were the first ones I found in a palatable (theivable) form.

I really should try and get biquad equivalents and use the arm math library with DSP optimisations.  I might for example get a wider range of high order filters.

So I will return and review the data packing/unpacking at the same time.

For now I think I'm going to do a blunt ignorant shift and or like the HAL code does :)

On the optimisation...  my assumption was that the C compiler would include things like hardware DP FPU and DSP extensions if they existed with no optimisations.

However now I'm wondering if it actually disables all of that so you can step over some floating point calcs in the debugger, where as if they are offloaded to teh FPU that becomes difficult?

I mean how much "extra" stuff is turned on for -O3, -Ofast

Nominal Animal:

--- Quote from: paulca on February 01, 2023, 05:44:40 pm ---Most of that went over my head.
--- End quote ---
Sorry!  :-[

To summarise, if you have arrays of 24-bit samples, three bytes per sample, I showed that it is best to handle them in groups of four samples (12 bytes), and very efficiently sign-expand to 32-bit.  The rest was details, and two different approaches you can choose from, although one has better performance.

--- Quote from: paulca on February 01, 2023, 05:44:40 pm ---On the optimisation...  my assumption was that the C compiler would include things like hardware DP FPU and DSP extensions if they existed with no optimisations.
--- End quote ---
STM32F4 series has an ARM Cortex-M4 core, and thus ARMv7E-M architecture.
ST's AN4841: Digital signal processing for STM32 microcontrollers using CMSIS describes its DSP features (but also includes some of the STM32F7 series).
Since GCC and Clang try to implement ARM C Language Extensions, ACLE (ihi0053) is also useful.

In general, C is not an easy language to automatically SIMD-vectorize.  For STM32F4, there are instructions that add, subtract, and/or multiply using pairs of signed or unsigned 16-bit values in each register, or quartets of 8-bit signed or unsigned values in each registers.

As of right now, you do need to use the ACLE macros to SIMD-vectorize your C code.

For example, let's say you have an array of 16-bit signed integer samples data, another array of 16-bit signed integer coefficients coefficient, both arrays having count elements (even; multiple of two count), and both arrays being 32-bit aligned (__attribute__((align (4)))), you can obtain the 64-bit sum (which cannot overflow since each product is between -0x3FFF8000 and +0x40000000, i.e. 31 bits):

--- Code: ---#include <arm_acle.h>

typedef struct {
    union {
        uint32_t   u32;
        int32_t    i32;
        uint16_t   u16[2];
        int16_t    i16[2];
        uint16x2_t u16x2;
        int16x2_t  i16x2;
        uint8_t     u8[4];
        int8_t      i8[4];
        uint8x4_t  u8x4;
        int8x4_t   i8x4;
        char       c[4];
} word32;

int64_t sum_i16xi16(const int16_t *data, const int16_t *coeff, const uint32_t count)
    const word32 *dz = (const word32 *)data + (count / 2);
    const word32 *d = (const word32 *)data;
    const word32 *c = (const word32 *)coeff;
    int64_t  result = 0;

    while (d < dz)
        result = __smlald((*(d++)).i16x2, (*(c++)).i16x2, result);

    return result;

--- End code ---
using -O2 -march=armv7e-m+fp -mtune=cortex-m4 -mthumb (with gcc; use -Os instead of -O2 with clang).

The inner loop will compile to
        ldr     r0, [r3], #4        ; 2 cycles
        ldr     r2, [r1], #4        ; 2 cycles
        cmp     ip, r3, #1          ; 1 cycle
        smlald  r4, r5, r2, r0      ; 1 cycle
        bhi     .L3                 ; 2-5 cycles when taken
which by my count should yield 8 cycles, or 4 cycles per sample, or equivalently 0.25 samples per cycle.
Furthermore, the result is trivial to shift and then clamp to the proper range, so it is just a few additional cycles per buffer to adjust for the fractional bits in samples and/or coefficients.  Me Like Dis.

If count is odd, the last sample and coefficient will be ignored.  I suggest you ensure your array sizes are even, and set the final element of both arrays to zero, so you can round count up to the next even number without affecting the result.

The inner loop is equivalent to

--- Code: ---    for (uint32_t i = 0; i < count/2; i++)
        result = __smlald(d[i].i16x2, c[i].i16x2, result);

--- End code ---
but is written in a form that both gcc and clang can optimize to the above code or equivalent.
Each loop iteration computes two 16-bit products, summing them to result.


[0] Message Index

[#] Next page

[*] Previous page

There was an error while thanking
Go to full version
Powered by SMFPacks Advanced Attachments Uploader Mod