Most of that went over my head.

Sorry!

To summarise, if you have arrays of 24-bit samples, three bytes per sample, I showed that it is best to handle them in groups of four samples (12 bytes), and very efficiently sign-expand to 32-bit. The rest was details, and two different approaches you can choose from, although one has better performance.

On the optimisation... my assumption was that the C compiler would include things like hardware DP FPU and DSP extensions if they existed with no optimisations.

STM32F4 series has an ARM Cortex-M4 core, and thus ARMv7E-M architecture.

ST's

AN4841: Digital signal processing for STM32 microcontrollers using CMSIS describes its DSP features (but also includes some of the STM32F7 series).

Since GCC and Clang try to implement ARM C Language Extensions,

ACLE (ihi0053) is also useful.

In general, C is not an easy language to automatically SIMD-vectorize. For STM32F4, there are instructions that add, subtract, and/or multiply using pairs of signed or unsigned 16-bit values in each register, or quartets of 8-bit signed or unsigned values in each registers.

As of right now, you do need to use the ACLE macros to SIMD-vectorize your C code.

For example, let's say you have an array of 16-bit signed integer samples

`data`, another array of 16-bit signed integer coefficients

`coefficient`, both arrays having

`count` elements (even; multiple of two

`count`), and both arrays being 32-bit aligned (

`__attribute__((align (4)))`), you can obtain the 64-bit sum (which cannot overflow since each product is between -0x3FFF8000 and +0x40000000, i.e. 31 bits):

`#include <arm_acle.h>`

typedef struct {

union {

uint32_t u32;

int32_t i32;

uint16_t u16[2];

int16_t i16[2];

uint16x2_t u16x2;

int16x2_t i16x2;

uint8_t u8[4];

int8_t i8[4];

uint8x4_t u8x4;

int8x4_t i8x4;

char c[4];

};

} word32;

int64_t sum_i16xi16(const int16_t *data, const int16_t *coeff, const uint32_t count)

{

const word32 *dz = (const word32 *)data + (count / 2);

const word32 *d = (const word32 *)data;

const word32 *c = (const word32 *)coeff;

int64_t result = 0;

while (d < dz)

result = __smlald((*(d++)).i16x2, (*(c++)).i16x2, result);

return result;

}

using

`-O2 -march=armv7e-m+fp -mtune=cortex-m4 -mthumb` (with gcc; use -Os instead of -O2 with clang).

The inner loop will compile to

` .L3:`` ldr r0, [r3], #4 ;` 2 cycles

` ldr r2, [r1], #4 ;` 2 cycles

` cmp ip, r3, #1 ;` 1 cycle

` smlald r4, r5, r2, r0 ;` 1 cycle

` bhi .L3 ;` 2-5 cycles when taken

which by my count should yield 8 cycles, or 4 cycles per sample, or equivalently 0.25 samples per cycle.

Furthermore, the result is trivial to shift and then clamp to the proper range, so it is just a few additional cycles

*per buffer* to adjust for the fractional bits in samples and/or coefficients.

*Me Like Dis.*If

`count` is odd, the last sample and coefficient will be ignored. I suggest you ensure your array sizes are even, and set the final element of both arrays to zero, so you can round

`count` up to the next even number without affecting the result.

The inner loop is equivalent to

` for (uint32_t i = 0; i < count/2; i++)`

result = __smlald(d[i].i16x2, c[i].i16x2, result);

but is written in a form that both gcc and clang can optimize to the above code or equivalent.

Each loop iteration computes two 16-bit products, summing them to

`result`.