Author Topic: FFT processing using uCPU.  (Read 6663 times)

0 Members and 1 Guest are viewing this topic.

Online MasterTTopic starter

  • Frequent Contributor
  • **
  • Posts: 785
  • Country: ca
FFT processing using uCPU.
« on: May 02, 2018, 10:14:14 pm »
I did a few project in the past, that were based on FFT algorithm.
It was pointed out in another thread, that Keil has a library real /complex fft available on line.
https://www.keil.com/pack/doc/CMSIS/DSP/html/group__RealFFT.html
https://github.com/ARM-software/CMSIS_5/releases/tag/5.3.0
From the brief review,  I could say that code is based on radix-2, the simplest and slowest implementation of the fft.
I was wander if someone have this code running on Cortex-3 (atmel's SAM3X would be perfect, or stm32f103), and could share performance results, something  similar to  STM does in theirs UM0585 app. note.
STM approach is radix-4, I'd expect 1.5 - 2x better/ faster than Keil's version.

 


« Last Edit: May 07, 2018, 03:22:56 pm by MasterT »
 

Offline ogden

  • Super Contributor
  • ***
  • Posts: 3731
  • Country: lv
Re: FFT processing using uCPU.
« Reply #1 on: May 02, 2018, 10:38:51 pm »
I could say that code is based on radix-2, the simplest and slowest implementation of the fft.

Just by looking source code file list you will see that what you say about "radix2 only" is simply not true:

https://github.com/ARM-software/CMSIS_5/tree/develop/CMSIS/DSP/Source/TransformFunctions

Not only it uses radix-4 and radix-8 but for radix-4 16bit precision (arm_cfft_radix4_q15.c) it uses SIMD instructions on CPU's supporting such.

[edit] Some CMSIS performance figures in ST AN4841
« Last Edit: May 02, 2018, 10:54:20 pm by ogden »
 

Online iMo

  • Super Contributor
  • ***
  • Posts: 4784
  • Country: pm
  • It's important to try new things..
Re: FFT processing using uCPU.
« Reply #2 on: May 02, 2018, 10:54:47 pm »
Maybe it helps somehow - some time back I built the CMSIS FFT Float32 "arm_fft_bin_example_f32"
(1024 points float32 FFT, FPU on, STM32F407ZET, Input/output is single precision float):

-O3, -g, 4.8.3-2014q1, 168MHz

FFT.. elapsed 700 microsecs
FFT Magn.. elapsed 211 microsecs
FFT Bins.. elapsed 61 microsecs

-O3, -g, 4.8.3-2014q1, 240MHz

FFT.. elapsed 490 microsecs
FFT Magn.. elapsed 149 microsecs
FFT Bins.. elapsed 43 microsecs

Afaik the q31 is ~1.5x slower than the FFT with single precision FPU, ie. ( here the q15/q31 perf tables ) EDIT: actually the same AN4841 as above..

FFT 1024 points on STM32F429 (180MHz):

q15 457us
q31 855us
single precision float 547us

Here is my result with 3 tones (math generated tones, single precision FPU, STM32F407) - you may see the single precision noise background (no windowing):
« Last Edit: May 03, 2018, 05:22:42 am by imo »
 
The following users thanked this post: ogden, MasterT

Online MasterTTopic starter

  • Frequent Contributor
  • **
  • Posts: 785
  • Country: ca
Re: FFT processing using uCPU.
« Reply #3 on: May 03, 2018, 12:29:20 am »
I could say that code is based on radix-2, the simplest and slowest implementation of the fft.

Just by looking source code file list you will see that what you say about "radix2 only" is simply not true:

https://github.com/ARM-software/CMSIS_5/tree/develop/CMSIS/DSP/Source/TransformFunctions

Not only it uses radix-4 and radix-8 but for radix-4 16bit precision (arm_cfft_radix4_q15.c) it uses SIMD instructions on CPU's supporting such.

[edit] Some CMSIS performance figures in ST AN4841
We discussed Real fft on another thread, and my mental lock was still there, so I was talking "Keil's CMSIS for Real FFT has Radix-2 implementation".  I did not say only, and actually you are right, Keil's DSP lib does include Radix-4&8 for Complex FFT. There is no Radix-4&8 for Real FFT.

 Thanks for app. note, so I'd highlight a couple lines:
 
1. from UM0585:  Table 15.
Complex radix 4, 16-bit FFT, coefficients in RAM
72 MHz 2 wait states
1024 points 100 180 4.174 ms 102 057 2.126 ms 127 318 1.768 ms

2. from AN4841 Application note
Table 5. FFT performance
STM32F103 72 MHz M3
Q31 1024 214098 2973    Q15  1024 248936 3457

There is some benchmark data for same SplitRadix-9 running on arduino DUE (M3 84MHz) , compiled in arduino IDE:

 * fft.1024
 * Hamng 453   Revb 383   SplitRR 1589   GainR 234   Sqrt 3430   Sqrt2 204     
 
Updates:  Arduino Due (Atmel's SAM3X8E cortex-m3) library is attached.
Feedback, comments are welcome. 
« Last Edit: May 07, 2018, 03:25:49 pm by MasterT »
 

Online iMo

  • Super Contributor
  • ***
  • Posts: 4784
  • Country: pm
  • It's important to try new things..
Re: FFT processing using uCPU.
« Reply #4 on: May 03, 2018, 10:18:17 am »
How would your SR-9 perform with floating point (single precision FPU)?
 

Offline Karel

  • Super Contributor
  • ***
  • Posts: 2217
  • Country: 00
Re: FFT processing using uCPU.
« Reply #5 on: May 03, 2018, 10:35:23 am »
You could try & benchmark this one: https://sourceforge.net/projects/kissfft/
 

Offline ogden

  • Super Contributor
  • ***
  • Posts: 3731
  • Country: lv
Re: FFT processing using uCPU.
« Reply #6 on: May 03, 2018, 10:37:01 am »
1. from UM0585:  Table 15.

Thank you for valuable input. I am struggling to find source code of this appnote. Do you happen to know any pointers?

Quote
I will search if I still have a copy of my SplitRadix-9 for stm32f103 72 MHz, but I pretty sure my results is better than any numbers posted above.

Yes, please. It is 16-bit Q15 precision? It is your implementation of you did borrow it? If so - then please tell where it came from as well.

 

Online MasterTTopic starter

  • Frequent Contributor
  • **
  • Posts: 785
  • Country: ca
Re: FFT processing using uCPU.
« Reply #7 on: May 03, 2018, 11:45:30 am »
How would your SR-9 perform with floating point (single precision FPU)?
Code written and highly  optimized for integer only uCPU.  I think the right name is Q15.  There are some short note in the header file, I mention 12-bits since arduino DUE Sam3X has 12-bits adc, but actually anything from 8 to 24 bits (is it Q8 and Q24 ?) should run.  There may be overflow errors with big size FFT and 24-bits variables.

 I do have original algorithm in floating point math, and it was posted here:
http://www.stm32duino.com/viewtopic.php?f=28&t=1872 but since stm32f103 (nether sam3 ) doesn't have FPU, I have no idea how  fast it run.

1. from UM0585:  Table 15.

Thank you for valuable input. I am struggling to find source code of this appnote. Do you happen to know any pointers?

Quote
I will search if I still have a copy of my SplitRadix-9 for stm32f103 72 MHz, but I pretty sure my results is better than any numbers posted above.

Yes, please. It is 16-bit Q15 precision? It is your implementation of you did borrow it? If so - then please tell where it came from as well.
What about STM web site, have you checked?  If they removed it, there is
stm32f10x_stdperiph_lib.zip file on my hard drive dated to 2012  22 Mb, I can't post it here but could upload to google-drive if needed.

 About borrowing, I did get some inspiration from floating point source,  from here http://www.jjj.de/fft/fftpage.html
First of all, I fixed a few  bugs , and if you run text-diff some kind of utility , between copy from jj.de and copy from my post http://www.stm32duino.com/viewtopic.php?f=28&t=1872  you would find out where exactly.

 Integer DSP library has close to Zero resemblance with original source,  though with all due respect I included references in library header file.
« Last Edit: May 07, 2018, 03:27:51 pm by MasterT »
 

Offline ogden

  • Super Contributor
  • ***
  • Posts: 3731
  • Country: lv
Re: FFT processing using uCPU.
« Reply #8 on: May 03, 2018, 12:36:54 pm »
What about STM web site, have you checked?

Their website is pain in the back to say it politely. Seems like they dropped own DSP library in favor of CMSIS.


Quote
If they removed it, there is
stm32f10x_stdperiph_lib.zip file on my hard drive dated to 2012  22 Mb, I can't post it here but could upload to google-drive if needed.

Are you sure DSP library is/was part of stdperiph.lib? I am asking because current f10x Peripheral Library does not include anything about DSP. In short: I would be happy to receive PM with google drive URL.

[edit] Just noticed that f3x peripheral libraries have DSP part [edit2] - It's CMSIS DSP. So frustrating  :-//
« Last Edit: May 03, 2018, 12:43:47 pm by ogden »
 

Online JPortici

  • Super Contributor
  • ***
  • Posts: 3461
  • Country: it
Re: FFT processing using uCPU.
« Reply #9 on: May 03, 2018, 02:37:20 pm »
about 4 years ago i had to make an FFT or a signal for an assignment, in that course we were using STM32 and at that time they were already using CMSIS for the dsp library
 

Online MasterTTopic starter

  • Frequent Contributor
  • **
  • Posts: 785
  • Country: ca
Re: FFT processing using uCPU.
« Reply #10 on: May 03, 2018, 03:19:25 pm »
Quote
If they removed it, there is
stm32f10x_stdperiph_lib.zip file on my hard drive dated to 2012  22 Mb, I can't post it here but could upload to google-drive if needed.
Are you sure DSP library is/was part of stdperiph.lib? I am asking because current f10x Peripheral Library does not include anything about DSP. In short: I would be happy to receive PM with google drive URL.
No, it doesn't include DSP lib.  Deleted.  My quick google search brings:
http://users.ece.utexas.edu/~valvano/EE345M/ , scroll page to example code  /design tools , there is

http://users.ece.utexas.edu/~valvano/EE345M/STM32F10x_DSP_Lib_V2.0.0_setup.exe
 
The following users thanked this post: ogden

Offline mark03

  • Frequent Contributor
  • **
  • Posts: 711
  • Country: us
Re: FFT processing using uCPU.
« Reply #11 on: May 04, 2018, 04:08:26 am »
I'm using the CMSIS DSP library quite heavily in a work project.  There are parts of it, e.g. the FIR filtering functions, which are obviously not optimized well at all.  I was very surprised to find that my own filtering routine, written quickly in C without any attempt at optimization, was just as fast as theirs.  Far and away the most popular MCU architecture in the world, and no one, AFAIK, has sat down to hand-tune the core DSP functions in assembly.  That tells you something about the state of the world!  Cheaper to speed up the chip than make the SW efficient.  It makes me sad, but you can't argue with the economics.  I fantasize about being marooned on a tropical island for a year with nothing but a dev board, reference manual, and assembler, tasked with achieving the absolute minimum cycle count.

Still, when someone benchmarks a DSP library, they home in on the FFT functions, and for that reason alone, I would be surprised if the CMSIS FFTs were not at least within a factor of two of the theoretical best (over a broad average of power-of-two sizes).  It would be nice to see some hard data.  My biggest annoyance with the CMSIS FFT is not speed, but the fact that the forward transform trashes its input buffer.  As long as I can't re-use the input, I would rather sacrifice a few more cycles for an in-place version.
 

Offline daslolo

  • Regular Contributor
  • *
  • Posts: 63
  • Country: fr
  • I saw wifi signal once
Re: FFT processing using uCPU.
« Reply #12 on: May 04, 2018, 04:33:31 am »
« Last Edit: May 04, 2018, 04:37:56 am by daslolo »
nine nine nein
 

Offline ogden

  • Super Contributor
  • ***
  • Posts: 3731
  • Country: lv
Re: FFT processing using uCPU.
« Reply #13 on: May 04, 2018, 08:31:41 am »
That tells you something about the state of the world!  Cheaper to speed up the chip than make the SW efficient.

There' s nothing wrong with state of the world. CMSIS (DSP) library just proves (again) that there's no such thing as free lunch. If you want hi-performance/quality DSP code for ARM, then you either invest your development resources or buy it:

https://developer.arm.com/technologies/dsp/arm-dsp-ecosystem-partners

Note that this is list of "official" ARM partners, there's lot of DSP development houses around.
 

Offline ehughes

  • Frequent Contributor
  • **
  • Posts: 409
  • Country: us
Re: FFT processing using uCPU.
« Reply #14 on: May 04, 2018, 10:34:28 am »
I do a talk at ESC on DSP with the M4/M7 .   I have attached some of the raw performance data on a spreadsheet.    Should be self explanatory.    Everything is measured/reported in cycles (not clock rate).

Some comparisons included:

-Optimization Levels in GCC
-M4 vs M7
-hand tuned vs CMSIS
-GCC vs Keil
-Flash execution vs RAM execution.


Quote
Far and away the most popular MCU architecture in the world, and no one, AFAIK, has sat down to hand-tune the core DSP functions in assembly.

I can guarantee you 100% this is not true  :)    I am looking at my library right now.



« Last Edit: May 04, 2018, 10:47:44 am by ehughes »
 
The following users thanked this post: mark03, ogden

Offline ogden

  • Super Contributor
  • ***
  • Posts: 3731
  • Country: lv
Re: FFT processing using uCPU.
« Reply #15 on: May 04, 2018, 01:04:13 pm »
Anyone tried Ne10 libraries on "big" ARM CPUs that have NEON simd?
 

Online MasterTTopic starter

  • Frequent Contributor
  • **
  • Posts: 785
  • Country: ca
Re: FFT processing using uCPU.
« Reply #16 on: May 04, 2018, 01:17:12 pm »
To: ehughes, thanks for sharing, quite interesting. Data looks "real" , though I didn't done fft tuning for a few years.
I checked new uCPU dev. boards available, and it seems that new trend is dual+ core cpu, sometimes with some magic trigonometry / accelerator bloc   :-//  plus dual core (TI). LPC54114 is no difference, this makes me wander how clock cycles may be translated back to microseconds !? 
And one more question, spread sheet says RFFT-f32 & RFFT-q15. Comments in the  Keil's software file :
/**
 * @brief Processing function for the floating-point RFFT/RIFFT.
 * @deprecated Do not use this function.  It has been superceded by \ref arm_rfft_fast_f32 and will be removed
 * in the future.
 */
 My understanding, that data in the spread sheet is not based on CMSIS-5.3.0, I linked in the first post? If so, could you please specify at least what is the rfft_f32 is in "meta-form", asking for complete code wouldn't be polite ? 
Meta-form, like:
Code: [Select]
void arm_rfft_f32(
  const arm_rfft_instance_f32 * S,
  float32_t * pSrc,
  float32_t * pDst)
{
  const arm_cfft_radix4_instance_f32 *S_CFFT = S->pCfft;


  /* Calculation of Real IFFT of input */
  if (S->ifftFlagR == 1U)
  {
    /*  Real IFFT core process */
    arm_split_rifft_f32(pSrc, S->fftLenBy2, S->pTwiddleAReal,
                        S->pTwiddleBReal, pDst, S->twidCoefRModifier);


    /* Complex radix-4 IFFT process */
    arm_radix4_butterfly_inverse_f32(pDst, S_CFFT->fftLen,
                                     S_CFFT->pTwiddle,
                                     S_CFFT->twidCoefModifier,
                                     S_CFFT->onebyfftLen);

    /* Bit reversal process */
    if (S->bitReverseFlagR == 1U)
    {
      arm_bitreversal_f32(pDst, S_CFFT->fftLen,
                          S_CFFT->bitRevFactor, S_CFFT->pBitRevTable);
    }
  }
  else
  {

    /* Calculation of RFFT of input */

    /* Complex radix-4 FFT process */
    arm_radix4_butterfly_f32(pSrc, S_CFFT->fftLen,
                             S_CFFT->pTwiddle, S_CFFT->twidCoefModifier);

    /* Bit reversal process */
    if (S->bitReverseFlagR == 1U)
    {
      arm_bitreversal_f32(pSrc, S_CFFT->fftLen,
                          S_CFFT->bitRevFactor, S_CFFT->pBitRevTable);
    }


    /*  Real FFT core process */
    arm_split_rfft_f32(pSrc, S->fftLenBy2, S->pTwiddleAReal,
                       S->pTwiddleBReal, pDst, S->twidCoefRModifier);
  }

}
 

Offline mark03

  • Frequent Contributor
  • **
  • Posts: 711
  • Country: us
Re: FFT processing using uCPU.
« Reply #17 on: May 04, 2018, 02:41:40 pm »
I do a talk at ESC on DSP with the M4/M7 .   I have attached some of the raw performance data on a spreadsheet.    Should be self explanatory.    Everything is measured/reported in cycles (not clock rate).
Thanks for sharing this.  I only saw one chart for CMSIS vs something else ("Eli"?), on IIR filters.  It would be interesting to see more comparisons of CMSIS vs alternatives, particularly for the FFT.  But all of the charts are pretty interesting.  In particular, I don't see evidence to back up the opt-repeated claim that $$$ compilers do a lot better than gcc.  Your results show that it's quite unpredictable. 

I can guarantee you 100% this is not true  :)    I am looking at my library right now.
I stand corrected.  Well, I was being intentionally hyperbolic too, obviously somebody somewhere has done this, but the impression I had of the commercial DSP outfits was that they were principally offering higher-level blocks, things like MP3 and JPEG codecs, rather than general-purpose libraries to replace CMSIS-DSP.  In discussion forums you never see "should I use CMSIS-DSP or should I pay for xxxx instead?"  AFAICT nothing has gained comparable popularity in the ARM community to, e.g., Intel Performance Primitives in the x86 world.  You would think the purveyors of such libraries would be promoting their benchmarks against CMSIS-DSP like crazy.  That is what prompted the comment that, by and large, people aren't interested in, say, a 2x performance differential.
 

Offline ehughes

  • Frequent Contributor
  • **
  • Posts: 409
  • Country: us
Re: FFT processing using uCPU.
« Reply #18 on: May 04, 2018, 03:10:16 pm »
Sorry, the spreadsheet titles were not updated.   arm_rfft_fast_f32 was the actual function used.     At the time when I did the exercise,  I was using CMSIS 4.5 (not much has changed since then).

Here is a snippet from the test code (sorry if the formatting gets mangled).      The cycle counter is calibrated so I subtract out the time time it takes to instrument the function.

The "RAM" execution give a high fidelity number of what any ARM M4 can do.     There may be some small difference in flash execution speeds between vendors (ST, NXP) but I would not expect much difference except when cache is involved (which only applies to a handful of parts).   All of these tests used on M4 instructions (not peripheral accelerators).   Simple multiple the cycle count by your clock period.

Some other take-away I found when doing the experiment

Don't purchase Keil if you think the code will be faster.   It is not.  In fact don't purchase it for any reason other than it is the 1st thing to support new chips and if you like the middleware.   The scatter files are a bit more user friendly than GCC linker command files.
(I did not have a license to IAR but I expect it to the be the same as Keil.   A shitty IDE that costs 10k)
GCC doesn't use the DSP instructions unless you get to -02 or -03.     Especially the fancy fixed point ones like SMLAL
In some cases (like the per-sample PID code),  a human could not do much better than an optimizer.  Looking at the recompiled results,   trying to do better by hand would not make economic sense.
In other cases,   you can do a bit better than CMSIS.    In my case,  I needed a per-sample IIR and CMSIS uses block processing.     In these cases,  stripping away the block handling calls in the case of a single sample yields some good results.


Code: [Select]
ifdef ENABLE_RFFT_NBR
CM_PRINTF("\r\n");
CM_PRINTF("RFFT-f32-NoBitReverse,");

CM_PRINTF("n/a");COMMA;
arm_rfft_fast_init_f32(&FFT_Inst.rfft_fast_f32,32);START_CYCLE_TIMER;arm_rfft_fast_f32(&FFT_Inst.rfft_fast_f32,&InputData.f32[0], &OutputData.f32[0], 0);REPORT_CYCLE_TIMER;COMMA;
arm_rfft_fast_init_f32(&FFT_Inst.rfft_fast_f32,64);START_CYCLE_TIMER;arm_rfft_fast_f32(&FFT_Inst.rfft_fast_f32,&InputData.f32[0], &OutputData.f32[0], 0);REPORT_CYCLE_TIMER;COMMA;
arm_rfft_fast_init_f32(&FFT_Inst.rfft_fast_f32,128);START_CYCLE_TIMER;arm_rfft_fast_f32(&FFT_Inst.rfft_fast_f32,&InputData.f32[0], &OutputData.f32[0], 0);REPORT_CYCLE_TIMER;COMMA;
arm_rfft_fast_init_f32(&FFT_Inst.rfft_fast_f32,256);START_CYCLE_TIMER;arm_rfft_fast_f32(&FFT_Inst.rfft_fast_f32,&InputData.f32[0], &OutputData.f32[0], 0);REPORT_CYCLE_TIMER;COMMA;
arm_rfft_fast_init_f32(&FFT_Inst.rfft_fast_f32,512);START_CYCLE_TIMER;arm_rfft_fast_f32(&FFT_Inst.rfft_fast_f32,&InputData.f32[0], &OutputData.f32[0], 0);REPORT_CYCLE_TIMER;COMMA;
arm_rfft_fast_init_f32(&FFT_Inst.rfft_fast_f32,1024);START_CYCLE_TIMER;arm_rfft_fast_f32(&FFT_Inst.rfft_fast_f32,&InputData.f32[0], &OutputData.f32[0], 0);REPORT_CYCLE_TIMER;COMMA;
arm_rfft_fast_init_f32(&FFT_Inst.rfft_fast_f32,2048);START_CYCLE_TIMER;arm_rfft_fast_f32(&FFT_Inst.rfft_fast_f32,&InputData.f32[0], &OutputData.f32[0], 0);REPORT_CYCLE_TIMER;COMMA;
arm_rfft_fast_init_f32(&FFT_Inst.rfft_fast_f32,4096);START_CYCLE_TIMER;arm_rfft_fast_f32(&FFT_Inst.rfft_fast_f32,&InputData.f32[0], &OutputData.f32[0], 0);REPORT_CYCLE_TIMER;COMMA;
« Last Edit: May 04, 2018, 03:11:51 pm by ehughes »
 

Offline ehughes

  • Frequent Contributor
  • **
  • Posts: 409
  • Country: us
Re: FFT processing using uCPU.
« Reply #19 on: May 04, 2018, 03:12:13 pm »
Sorry, the spreadsheet titles were not updated.   arm_rfft_fast_f32 was the actual function used.     At the time when I did the exercise,  I was using CMSIS 4.5 (not much has changed since then).

Here is a snippet from the test code (sorry if the formatting gets mangled).      The cycle counter is calibrated so I subtract out the time time it takes to instrument the function.

The "RAM" execution give a high fidelity number of what any ARM M4 can do.     There may be some small difference in flash execution speeds between vendors (ST, NXP) but I would not expect much difference except when cache is involved (which only applies to a handful of parts).   All of these tests used on M4 instructions (not peripheral accelerators).   Simple multiple the cycle count by your clock period.

Some other take-aways I found when doing the experiment:

Don't purchase Keil if you think the code will be faster.   It is not.  In fact don't purchase it for any reason other than it is the 1st thing to support new chips and if you like the middleware.   The scatter files are a bit more user friendly than GCC linker command files.
(I did not have a license to IAR but I expect it to the be the same as Keil.   A shitty IDE that costs 10k)
GCC doesn't use the DSP instructions unless you get to -02 or -03.     Especially the fancy fixed point ones like SMLAL
In some cases (like the per-sample PID code),  a human could not do much better than an optimizer.  Looking at the recompiled results,   trying to do better by hand would not make economic sense.
In other cases,   you can do a bit better than CMSIS.    In my case,  I needed a per-sample IIR and CMSIS uses block processing.     In these cases,  stripping away the block handling calls in the case of a single sample yields some good results.


Code: [Select]
ifdef ENABLE_RFFT_NBR
CM_PRINTF("\r\n");
CM_PRINTF("RFFT-f32-NoBitReverse,");

CM_PRINTF("n/a");COMMA;
arm_rfft_fast_init_f32(&FFT_Inst.rfft_fast_f32,32);START_CYCLE_TIMER;arm_rfft_fast_f32(&FFT_Inst.rfft_fast_f32,&InputData.f32[0], &OutputData.f32[0], 0);REPORT_CYCLE_TIMER;COMMA;
arm_rfft_fast_init_f32(&FFT_Inst.rfft_fast_f32,64);START_CYCLE_TIMER;arm_rfft_fast_f32(&FFT_Inst.rfft_fast_f32,&InputData.f32[0], &OutputData.f32[0], 0);REPORT_CYCLE_TIMER;COMMA;
arm_rfft_fast_init_f32(&FFT_Inst.rfft_fast_f32,128);START_CYCLE_TIMER;arm_rfft_fast_f32(&FFT_Inst.rfft_fast_f32,&InputData.f32[0], &OutputData.f32[0], 0);REPORT_CYCLE_TIMER;COMMA;
arm_rfft_fast_init_f32(&FFT_Inst.rfft_fast_f32,256);START_CYCLE_TIMER;arm_rfft_fast_f32(&FFT_Inst.rfft_fast_f32,&InputData.f32[0], &OutputData.f32[0], 0);REPORT_CYCLE_TIMER;COMMA;
arm_rfft_fast_init_f32(&FFT_Inst.rfft_fast_f32,512);START_CYCLE_TIMER;arm_rfft_fast_f32(&FFT_Inst.rfft_fast_f32,&InputData.f32[0], &OutputData.f32[0], 0);REPORT_CYCLE_TIMER;COMMA;
arm_rfft_fast_init_f32(&FFT_Inst.rfft_fast_f32,1024);START_CYCLE_TIMER;arm_rfft_fast_f32(&FFT_Inst.rfft_fast_f32,&InputData.f32[0], &OutputData.f32[0], 0);REPORT_CYCLE_TIMER;COMMA;
arm_rfft_fast_init_f32(&FFT_Inst.rfft_fast_f32,2048);START_CYCLE_TIMER;arm_rfft_fast_f32(&FFT_Inst.rfft_fast_f32,&InputData.f32[0], &OutputData.f32[0], 0);REPORT_CYCLE_TIMER;COMMA;
arm_rfft_fast_init_f32(&FFT_Inst.rfft_fast_f32,4096);START_CYCLE_TIMER;arm_rfft_fast_f32(&FFT_Inst.rfft_fast_f32,&InputData.f32[0], &OutputData.f32[0], 0);REPORT_CYCLE_TIMER;COMMA;
 

Offline daslolo

  • Regular Contributor
  • *
  • Posts: 63
  • Country: fr
  • I saw wifi signal once
Re: FFT processing using uCPU.
« Reply #20 on: May 07, 2018, 04:42:52 am »
I stumbled on this, maybe it's usable on mCU
http://www.fftw.org/
nine nine nein
 

Online MasterTTopic starter

  • Frequent Contributor
  • **
  • Posts: 785
  • Country: ca
Re: FFT processing using uCPU.
« Reply #21 on: May 07, 2018, 11:12:53 pm »
Things changed, since 2014. When I write my  library for arduino Due, there was not much alternative at that time.
I did some testing with CMSIS - arm_rfft_fast_f32,   and to my surprise it runs astonishingly fast. Nucleo stm32f303re board (cortex M4 72 MHz) completes fft-1024 in 1.2 milliseconds, about twice faster than my record timing.
Inspecting code more thoroughly, I find that they optimize anything what I could think off, create sin/cos LUTs for fft any sizes, LUTs for bit reversing, indexing by pointers etc.
BTW, what is the benchmark score for M3/M4 trigonometry functions? Can't find any data on-line. I think, I saw awhile ago on Keil's web-site, but nothing like that now
 
The following users thanked this post: daslolo


Share me

Digg  Facebook  SlashDot  Delicious  Technorati  Twitter  Google  Yahoo
Smf