Author Topic: ARM CMSIS DSP biquad Q31 vs F32 performance  (Read 6523 times)

0 Members and 1 Guest are viewing this topic.

Offline bobaruniTopic starter

  • Regular Contributor
  • *
  • Posts: 156
  • Country: au
ARM CMSIS DSP biquad Q31 vs F32 performance
« on: May 02, 2017, 12:30:31 pm »
Hi All,
I have put together an STM32F4 based USB host audio player (FLAC, WAV + more CODECs later) that feeds into an ES9018K2M DAC via I2S.
I am currently using Q31 biquads (arm_biquad_cas_df1_32x64_q31 and arm_biquad_cascade_df1_q31) for a 5 band EQ and am quickly running out of CPU time to do much else (player stutters badly when debug is enabled).
Source data can be 16-24 bit stereo at up to 192K.

As I'm new to CMSIS DSP and not finding much performance info on the web apart from this https://community.nxp.com/thread/327833, Just trying to get an idea of other people's experience with:
1.  Which data type should take less time on an STM32F4 with built in FPU, Q31 of F32?
2.  Are there any advantages in terms of precision, S/N ratio, THD, DNR etc by using Q31 or F32?

Thanks in advance!
« Last Edit: May 02, 2017, 12:40:24 pm by bobaruni »
 

Offline blkdev2

  • Newbie
  • Posts: 6
  • Country: us
Re: ARM CMSIS DSP biquad Q31 vs F32 performance
« Reply #1 on: May 07, 2017, 11:41:53 pm »
Have you seen this app note from ST?

AN4814

Seems to be negligible difference in speed between fixed and float on the F4.
 
The following users thanked this post: bobaruni

Offline bobaruniTopic starter

  • Regular Contributor
  • *
  • Posts: 156
  • Country: au
Re: ARM CMSIS DSP biquad Q31 vs F32 performance
« Reply #2 on: May 08, 2017, 04:51:44 am »
Awesome, thanks!.
So it seems F32 is very slightly faster than Q31 on FIR, looks promising so I will change the code for F32 and profile to see how much better it is in my application.
 

Offline ehughes

  • Frequent Contributor
  • **
  • Posts: 409
  • Country: us
Re: ARM CMSIS DSP biquad Q31 vs F32 performance
« Reply #3 on: May 11, 2017, 08:56:43 pm »
With q31,  GCC needs to be be at -02 or -O3 optimization levels to get the integer DSP instructions.   -O0 effectively puts the chip into "m3" mode.

The high precision biquad function with q31 is nice as it uses the SMLAL instruction (at the correct optimization levels).   This means you have 64-bit precision in the state variables and accumulator.

https://github.com/ehughes/ESC-M4

Goto the ESC 2017 branch.

I have quite a bit of testing on the M4/M7.


Some Highlights:

1.)   Use f32 over Q31 if you have the floating point unit, it is always faster.   If you want integer and speed, use q15
2.)   Compile with hard ABI, especially if you are function call heavy.
3.)   GCC needs to be at -02 or -03 for DSP code
4.)   GCC beats Keil in about 80% of test cases of the CMSIS library when comparing speed (clock cycles).


To see the effect of the opimization level on integers:

(The rand() and Printfs are nice markers)

Code: [Select]
#include "arm_math.h"
#include "stdio.h"
#include "stdlib.h"

int main()
{

q63_t a=0;
q31_t b=0;
q31_t c=0;

for(int i =0 ; i< 64 ; i++)
{
b = rand();
c = rand();
a+=(q63_t)b*c;
}

//Need this so the compiler Generates what we need!
printf("%ll",a);

return 0;

}

-O0

1a0003e0 <main>:
1a0003e0: b5b0      push {r4, r5, r7, lr}
1a0003e2: b086      sub sp, #24
1a0003e4: af00      add r7, sp, #0
1a0003e6: f04f 0300 mov.w r3, #0
1a0003ea: f04f 0400 mov.w r4, #0
1a0003ee: e9c7 3404 strd r3, r4, [r7, #16]
1a0003f2: 2300      movs r3, #0
1a0003f4: 60bb      str r3, [r7, #8]
1a0003f6: 2300      movs r3, #0
1a0003f8: 607b      str r3, [r7, #4]
1a0003fa: 2300      movs r3, #0
1a0003fc: 60fb      str r3, [r7, #12]
1a0003fe: e01f      b.n 1a000440 <main+0x60>
1a000400: f001 fc94 bl 1a001d2c <rand>
1a000404: 60b8      str r0, [r7, #8]
1a000406: f001 fc91 bl 1a001d2c <rand>
1a00040a: 6078      str r0, [r7, #4]
1a00040c: 68bb      ldr r3, [r7, #8]
1a00040e: 4619      mov r1, r3
1a000410: ea4f 72e1 mov.w r2, r1, asr #31
1a000414: 687b      ldr r3, [r7, #4]
1a000416: ea4f 74e3 mov.w r4, r3, asr #31
1a00041a: fb03 f502 mul.w r5, r3, r2
1a00041e: fb01 f004 mul.w r0, r1, r4
1a000422: 4428      add r0, r5
1a000424: fba1 3403 umull r3, r4, r1, r3
1a000428: 1902      adds r2, r0, r4
1a00042a: 4614      mov r4, r2
1a00042c: e9d7 1204 ldrd r1, r2, [r7, #16]
1a000430: 185b      adds r3, r3, r1
1a000432: eb44 0402 adc.w r4, r4, r2
1a000436: e9c7 3404 strd r3, r4, [r7, #16]
1a00043a: 68fb      ldr r3, [r7, #12]
1a00043c: 3301      adds r3, #1
1a00043e: 60fb      str r3, [r7, #12]
1a000440: 68fb      ldr r3, [r7, #12]
1a000442: 2b3f      cmp r3, #63 ; 0x3f
1a000444: dddc      ble.n 1a000400 <main+0x20>
1a000446: e9d7 2304 ldrd r2, r3, [r7, #16]
1a00044a: 4804      ldr r0, [pc, #16] ; (1a00045c <main+0x7c>)
1a00044c: f000 fd2e bl 1a000eac <printf>


-O3
1a0003fc <main>:
1a0003fc: b5f8      push {r3, r4, r5, r6, r7, lr}
1a0003fe: 2440      movs r4, #64 ; 0x40
1a000400: 2600      movs r6, #0
1a000402: 2700      movs r7, #0
1a000404: f001 fc76 bl 1a001cf4 <rand>
1a000408: 4605      mov r5, r0
1a00040a: f001 fc73 bl 1a001cf4 <rand>
1a00040e: 3c01      subs r4, #1
1a000410: fbc0 6705 smlal r6, r7, r0, r5
1a000414: d1f6      bne.n 1a000404 <main+0x8>
1a000416: 4632      mov r2, r6
1a000418: 463b      mov r3, r7
1a00041a: 4802      ldr r0, [pc, #8] ; (1a000424 <main+0x28>)
1a00041c: f000 fd2a bl 1a000e74 <printf>





 
The following users thanked this post: bobaruni

Offline Sal Ammoniac

  • Super Contributor
  • ***
  • Posts: 1672
  • Country: us
Re: ARM CMSIS DSP biquad Q31 vs F32 performance
« Reply #4 on: May 11, 2017, 11:36:18 pm »
4.)   GCC beats Keil in about 80% of test cases of the CMSIS library when comparing speed (clock cycles).

Which Keil? The old compiler or the new one that's based on Clang/LLVM?
Complexity is the number-one enemy of high-quality code.
 

Offline ehughes

  • Frequent Contributor
  • **
  • Posts: 409
  • Country: us
Re: ARM CMSIS DSP biquad Q31 vs F32 performance
« Reply #5 on: May 12, 2017, 10:05:11 pm »
5.06 ( this is what 99 percent of what existing customers are running)


I will be running tests eventually with LLVM once it is a bit more stable.    The linker and other binary tools aren't quite ready.   I have seen people generate object code with LLVM and then use GCC ld.    The end to end toolchain was a bit rough last I checked (about 6 months ago).   Maybe it is time to look again.
« Last Edit: May 12, 2017, 10:21:05 pm by ehughes »
 

Offline bobaruniTopic starter

  • Regular Contributor
  • *
  • Posts: 156
  • Country: au
Re: ARM CMSIS DSP biquad Q31 vs F32 performance
« Reply #6 on: May 22, 2017, 03:48:58 pm »
Ok, finally got some time to profile the difference between Q31 and F32 biquads.
Test was using GCC bare metal (EmBitz 1.10) with __FPU_PRESENT defined on STM32F407 @ 168MHZ.
Final input and output buffers are int32_t.
Even though there is conversion to and from float when using F32 biquads and no conversion when using Q31 biquads (only casting):
Overall, F32 takes 20% less clock cycles (therefore less time) than Q31.

I expected fixed point to be faster but the FPU hardware really helps.
Got to be pretty happy with that, thanks very much for the tips guys :)
 

Offline bobaruniTopic starter

  • Regular Contributor
  • *
  • Posts: 156
  • Country: au
Re: ARM CMSIS DSP biquad Q31 vs F32 performance
« Reply #7 on: May 23, 2017, 03:19:45 pm »
Update!
By changing the loop that unpacks interleaved int32 sample data to also perform conversion to float and scale all at the same time, overall, float is now more than 50% faster than q31 with my biquads.
arm_q31_to_float, arm_scale_f32 and arm_float_to_q31 are no longer required.

Code: [Select]
// Unpack interleaved L and R samples, convert to float32 and scale 0.5:

int32_t *AudBuffOut = AudBuffIn;
uint32_t s;

for (s = 0; s < BLOCKSIZE; s++)
{
        outputF32L[s] = (float32_t)*AudBuffIn++ / 4294967296;
        outputF32R[s] = (float32_t)*AudBuffIn++ / 4294967296;
}

//DSP code here

// Convert float32 to int32 and interleave samples so that they are formatted for DMA to I2s transfer:
for (s = 0; s < BLOCKSIZE; s++)
{
        *AudBuffOut++ = outputF32L[s] * 2147483648;
        *AudBuffOut++ = outputF32R[s] * 2147483648;
}
 


Share me

Digg  Facebook  SlashDot  Delicious  Technorati  Twitter  Google  Yahoo
Smf