Author Topic: ARM CMSIS DSP biquad Q31 vs F32 performance (Read 6523 times)

bobaruni · « **on:** May 02, 2017, 12:30:31 pm »

Hi All,
I have put together an STM32F4 based USB host audio player (FLAC, WAV + more CODECs later) that feeds into an ES9018K2M DAC via I2S.
I am currently using Q31 biquads (arm_biquad_cas_df1_32x64_q31 and arm_biquad_cascade_df1_q31) for a 5 band EQ and am quickly running out of CPU time to do much else (player stutters badly when debug is enabled).
Source data can be 16-24 bit stereo at up to 192K.

As I'm new to CMSIS DSP and not finding much performance info on the web apart from this https://community.nxp.com/thread/327833, Just trying to get an idea of other people's experience with:
1. Which data type should take less time on an STM32F4 with built in FPU, Q31 of F32?
2. Are there any advantages in terms of precision, S/N ratio, THD, DNR etc by using Q31 or F32?

Thanks in advance!

blkdev2 · « **Reply #1 on:** May 07, 2017, 11:41:53 pm »

Have you seen this app note from ST?

AN4814

Seems to be negligible difference in speed between fixed and float on the F4.

bobaruni · « **Reply #2 on:** May 08, 2017, 04:51:44 am »

Awesome, thanks!.
So it seems F32 is very slightly faster than Q31 on FIR, looks promising so I will change the code for F32 and profile to see how much better it is in my application.

ehughes · « **Reply #3 on:** May 11, 2017, 08:56:43 pm »

With q31, GCC needs to be be at -02 or -O3 optimization levels to get the integer DSP instructions. -O0 effectively puts the chip into "m3" mode.

The high precision biquad function with q31 is nice as it uses the SMLAL instruction (at the correct optimization levels). This means you have 64-bit precision in the state variables and accumulator.

https://github.com/ehughes/ESC-M4

Goto the ESC 2017 branch.

I have quite a bit of testing on the M4/M7.

Some Highlights:

1.) Use f32 over Q31 if you have the floating point unit, it is always faster. If you want integer and speed, use q15
2.) Compile with hard ABI, especially if you are function call heavy.
3.) GCC needs to be at -02 or -03 for DSP code
4.) GCC beats Keil in about 80% of test cases of the CMSIS library when comparing speed (clock cycles).

To see the effect of the opimization level on integers:

(The rand() and Printfs are nice markers)

Code: [Select]

#include "arm_math.h"
#include "stdio.h"
#include "stdlib.h"

int main()
{

	q63_t a=0;
	q31_t b=0;
	q31_t c=0;

	for(int i =0 ; i< 64 ; i++)
	{
		b = rand();
		c = rand();
		a+=(q63_t)b*c;
	}

	//Need this so the compiler Generates what we need!
	printf("%ll",a);

return 0;

}

-O0

1a0003e0 <main>:
1a0003e0:	b5b0      	push	{r4, r5, r7, lr}
1a0003e2:	b086      	sub	sp, #24
1a0003e4:	af00      	add	r7, sp, #0
1a0003e6:	f04f 0300 	mov.w	r3, #0
1a0003ea:	f04f 0400 	mov.w	r4, #0
1a0003ee:	e9c7 3404 	strd	r3, r4, [r7, #16]
1a0003f2:	2300      	movs	r3, #0
1a0003f4:	60bb      	str	r3, [r7, #8]
1a0003f6:	2300      	movs	r3, #0
1a0003f8:	607b      	str	r3, [r7, #4]
1a0003fa:	2300      	movs	r3, #0
1a0003fc:	60fb      	str	r3, [r7, #12]
1a0003fe:	e01f      	b.n	1a000440 <main+0x60>
1a000400:	f001 fc94 	bl	1a001d2c <rand>
1a000404:	60b8      	str	r0, [r7, #8]
1a000406:	f001 fc91 	bl	1a001d2c <rand>
1a00040a:	6078      	str	r0, [r7, #4]
1a00040c:	68bb      	ldr	r3, [r7, #8]
1a00040e:	4619      	mov	r1, r3
1a000410:	ea4f 72e1 	mov.w	r2, r1, asr #31
1a000414:	687b      	ldr	r3, [r7, #4]
1a000416:	ea4f 74e3 	mov.w	r4, r3, asr #31
1a00041a:	fb03 f502 	mul.w	r5, r3, r2
1a00041e:	fb01 f004 	mul.w	r0, r1, r4
1a000422:	4428      	add	r0, r5
1a000424:	fba1 3403 	umull	r3, r4, r1, r3
1a000428:	1902      	adds	r2, r0, r4
1a00042a:	4614      	mov	r4, r2
1a00042c:	e9d7 1204 	ldrd	r1, r2, [r7, #16]
1a000430:	185b      	adds	r3, r3, r1
1a000432:	eb44 0402 	adc.w	r4, r4, r2
1a000436:	e9c7 3404 	strd	r3, r4, [r7, #16]
1a00043a:	68fb      	ldr	r3, [r7, #12]
1a00043c:	3301      	adds	r3, #1
1a00043e:	60fb      	str	r3, [r7, #12]
1a000440:	68fb      	ldr	r3, [r7, #12]
1a000442:	2b3f      	cmp	r3, #63	; 0x3f
1a000444:	dddc      	ble.n	1a000400 <main+0x20>
1a000446:	e9d7 2304 	ldrd	r2, r3, [r7, #16]
1a00044a:	4804      	ldr	r0, [pc, #16]	; (1a00045c <main+0x7c>)
1a00044c:	f000 fd2e 	bl	1a000eac <printf>


-O3
1a0003fc <main>:
1a0003fc:	b5f8      	push	{r3, r4, r5, r6, r7, lr}
1a0003fe:	2440      	movs	r4, #64	; 0x40
1a000400:	2600      	movs	r6, #0
1a000402:	2700      	movs	r7, #0
1a000404:	f001 fc76 	bl	1a001cf4 <rand>
1a000408:	4605      	mov	r5, r0
1a00040a:	f001 fc73 	bl	1a001cf4 <rand>
1a00040e:	3c01      	subs	r4, #1
1a000410:	fbc0 6705 	smlal	r6, r7, r0, r5
1a000414:	d1f6      	bne.n	1a000404 <main+0x8>
1a000416:	4632      	mov	r2, r6
1a000418:	463b      	mov	r3, r7
1a00041a:	4802      	ldr	r0, [pc, #8]	; (1a000424 <main+0x28>)
1a00041c:	f000 fd2a 	bl	1a000e74 <printf>

Sal Ammoniac · « **Reply #4 on:** May 11, 2017, 11:36:18 pm »

Quote from: ehughes on May 11, 2017, 08:56:43 pm

4.) GCC beats Keil in about 80% of test cases of the CMSIS library when comparing speed (clock cycles).

Which Keil? The old compiler or the new one that's based on Clang/LLVM?

ehughes · « **Reply #5 on:** May 12, 2017, 10:05:11 pm »

5.06 ( this is what 99 percent of what existing customers are running)

I will be running tests eventually with LLVM once it is a bit more stable. The linker and other binary tools aren't quite ready. I have seen people generate object code with LLVM and then use GCC ld. The end to end toolchain was a bit rough last I checked (about 6 months ago). Maybe it is time to look again.

bobaruni · « **Reply #6 on:** May 22, 2017, 03:48:58 pm »

Ok, finally got some time to profile the difference between Q31 and F32 biquads.
Test was using GCC bare metal (EmBitz 1.10) with __FPU_PRESENT defined on STM32F407 @ 168MHZ.
Final input and output buffers are int32_t.
Even though there is conversion to and from float when using F32 biquads and no conversion when using Q31 biquads (only casting):
Overall, F32 takes 20% less clock cycles (therefore less time) than Q31.

I expected fixed point to be faster but the FPU hardware really helps.
Got to be pretty happy with that, thanks very much for the tips guys

bobaruni · « **Reply #7 on:** May 23, 2017, 03:19:45 pm »

Update!
By changing the loop that unpacks interleaved int32 sample data to also perform conversion to float and scale all at the same time, overall, float is now more than 50% faster than q31 with my biquads.
arm_q31_to_float, arm_scale_f32 and arm_float_to_q31 are no longer required.

Code: [Select]

// Unpack interleaved L and R samples, convert to float32 and scale 0.5:

int32_t *AudBuffOut = AudBuffIn;
uint32_t s;

for (s = 0; s < BLOCKSIZE; s++)
{
        outputF32L[s] = (float32_t)*AudBuffIn++ / 4294967296;
        outputF32R[s] = (float32_t)*AudBuffIn++ / 4294967296;
}

//DSP code here

// Convert float32 to int32 and interleave samples so that they are formatted for DMA to I2s transfer:
for (s = 0; s < BLOCKSIZE; s++)
{
        *AudBuffOut++ = outputF32L[s] * 2147483648;
        *AudBuffOut++ = outputF32R[s] * 2147483648;
}


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

EEVblog Electronics Community Forum

Author Topic: ARM CMSIS DSP biquad Q31 vs F32 performance (Read 6523 times)

bobaruni

ARM CMSIS DSP biquad Q31 vs F32 performance

blkdev2

Re: ARM CMSIS DSP biquad Q31 vs F32 performance

bobaruni

Re: ARM CMSIS DSP biquad Q31 vs F32 performance

ehughes

Re: ARM CMSIS DSP biquad Q31 vs F32 performance

Sal Ammoniac

Re: ARM CMSIS DSP biquad Q31 vs F32 performance

ehughes

Re: ARM CMSIS DSP biquad Q31 vs F32 performance

bobaruni

Re: ARM CMSIS DSP biquad Q31 vs F32 performance

bobaruni

Re: ARM CMSIS DSP biquad Q31 vs F32 performance

Share me