Sorry, the spreadsheet titles were not updated. arm_rfft_fast_f32 was the actual function used. At the time when I did the exercise, I was using CMSIS 4.5 (not much has changed since then).
Here is a snippet from the test code (sorry if the formatting gets mangled). The cycle counter is calibrated so I subtract out the time time it takes to instrument the function.
The "RAM" execution give a high fidelity number of what any ARM M4 can do. There may be some small difference in flash execution speeds between vendors (ST, NXP) but I would not expect much difference except when cache is involved (which only applies to a handful of parts). All of these tests used on M4 instructions (not peripheral accelerators). Simple multiple the cycle count by your clock period.
Some other take-away I found when doing the experiment
Don't purchase Keil if you think the code will be faster. It is not. In fact don't purchase it for any reason other than it is the 1st thing to support new chips and if you like the middleware. The scatter files are a bit more user friendly than GCC linker command files.
(I did not have a license to IAR but I expect it to the be the same as Keil. A shitty IDE that costs 10k)
GCC doesn't use the DSP instructions unless you get to -02 or -03. Especially the fancy fixed point ones like SMLAL
In some cases (like the per-sample PID code), a human could not do much better than an optimizer. Looking at the recompiled results, trying to do better by hand would not make economic sense.
In other cases, you can do a bit better than CMSIS. In my case, I needed a per-sample IIR and CMSIS uses block processing. In these cases, stripping away the block handling calls in the case of a single sample yields some good results.
ifdef ENABLE_RFFT_NBR
CM_PRINTF("\r\n");
CM_PRINTF("RFFT-f32-NoBitReverse,");
CM_PRINTF("n/a");COMMA;
arm_rfft_fast_init_f32(&FFT_Inst.rfft_fast_f32,32);START_CYCLE_TIMER;arm_rfft_fast_f32(&FFT_Inst.rfft_fast_f32,&InputData.f32[0], &OutputData.f32[0], 0);REPORT_CYCLE_TIMER;COMMA;
arm_rfft_fast_init_f32(&FFT_Inst.rfft_fast_f32,64);START_CYCLE_TIMER;arm_rfft_fast_f32(&FFT_Inst.rfft_fast_f32,&InputData.f32[0], &OutputData.f32[0], 0);REPORT_CYCLE_TIMER;COMMA;
arm_rfft_fast_init_f32(&FFT_Inst.rfft_fast_f32,128);START_CYCLE_TIMER;arm_rfft_fast_f32(&FFT_Inst.rfft_fast_f32,&InputData.f32[0], &OutputData.f32[0], 0);REPORT_CYCLE_TIMER;COMMA;
arm_rfft_fast_init_f32(&FFT_Inst.rfft_fast_f32,256);START_CYCLE_TIMER;arm_rfft_fast_f32(&FFT_Inst.rfft_fast_f32,&InputData.f32[0], &OutputData.f32[0], 0);REPORT_CYCLE_TIMER;COMMA;
arm_rfft_fast_init_f32(&FFT_Inst.rfft_fast_f32,512);START_CYCLE_TIMER;arm_rfft_fast_f32(&FFT_Inst.rfft_fast_f32,&InputData.f32[0], &OutputData.f32[0], 0);REPORT_CYCLE_TIMER;COMMA;
arm_rfft_fast_init_f32(&FFT_Inst.rfft_fast_f32,1024);START_CYCLE_TIMER;arm_rfft_fast_f32(&FFT_Inst.rfft_fast_f32,&InputData.f32[0], &OutputData.f32[0], 0);REPORT_CYCLE_TIMER;COMMA;
arm_rfft_fast_init_f32(&FFT_Inst.rfft_fast_f32,2048);START_CYCLE_TIMER;arm_rfft_fast_f32(&FFT_Inst.rfft_fast_f32,&InputData.f32[0], &OutputData.f32[0], 0);REPORT_CYCLE_TIMER;COMMA;
arm_rfft_fast_init_f32(&FFT_Inst.rfft_fast_f32,4096);START_CYCLE_TIMER;arm_rfft_fast_f32(&FFT_Inst.rfft_fast_f32,&InputData.f32[0], &OutputData.f32[0], 0);REPORT_CYCLE_TIMER;COMMA;