Author Topic: why FFT takes so long to run on a Cortex VS an FPGA  (Read 4584 times)

0 Members and 1 Guest are viewing this topic.

Offline ali_asadzadehTopic starter

  • Super Contributor
  • ***
  • Posts: 2134
  • Country: ca
why FFT takes so long to run on a Cortex VS an FPGA
« on: November 20, 2020, 01:42:19 pm »
Hi,
I have made some tests to compare the results with a Gowin and a Cortex M7 part
FFT size is 256 and it has 16bit data,

The code for the Cortex M is located in this tread, nothing special
https://www.eevblog.com/forum/microcontrollers/why-arm-cortex-m-fft-functions-for-q15-are-not-working!/

Also I have done it with Gowin FPGA and it has used only 2 multipliers and a few Block Ram's to hold the twiddle factors,

The Cortex M needs around 18K of clock pulses, and Gowin with almost no logic (around 1100 LUT), can do it around 1600 clock pulses! about 10X faster >:D

So what's the bottleneck in the Cortex, is there a way to make it faster?
ASiDesigner, Stands for Application specific intelligent devices
I'm a Digital Expert from 8-bits to 64-bits
 

Offline chris_leyson

  • Super Contributor
  • ***
  • Posts: 1556
  • Country: wales
Re: why FFT takes so long to run on a Cortex VS an FPGA
« Reply #1 on: November 20, 2020, 02:30:25 pm »
All of the additional instructions to move data into and out of ram accounts for a substantial overhead whereas in an FPGA the addressing can be done on the fly. DSP chips have dedicated address generators which take care of moving the data around but might only have a single ALU so there is quite a bit of overhead if you're doing a complex multiply or complex multiply accumulate. In an FPGA you can build a custom ALU to do complex MAC operations in a single clock cycle. Address generators and the parallelism of FPGAs really do speed things up. Draw out the signal flow graph for a complex butterfly and count up the number of data paths for both real and imaginary data. In an FPGA you can move all of that data around in a single clock cycle whereas a general purpose processor might only have a single data path and only one ALU. DSP chips will have two data paths and two ALUs. It's all about data flow.
 

Offline Doctorandus_P

  • Super Contributor
  • ***
  • Posts: 5076
  • Country: nl
Re: why FFT takes so long to run on a Cortex VS an FPGA
« Reply #2 on: November 20, 2020, 02:34:05 pm »
ARM Cortex is quite a big family with different instruction sets.
Maybe some compiler setting is wrong and  and not all capabilities of your uC are used.

If you want to check timings. A quite simple yet powerful method is to use any of your uC peripherals to send out some data from some parts of your program and then us a logic analyser to catch that data and measure the timing. You can also toggle an I/O pin each time a certain ISR is triggered. All kinds of variations are simple to do and give lots of insight in the program timing while have negligible impact on the overal speed of the uC, which can be very important for hard real-time control systems.
« Last Edit: November 20, 2020, 02:41:53 pm by Doctorandus_P »
 

Offline SiliconWizard

  • Super Contributor
  • ***
  • Posts: 17550
  • Country: fr
Re: why FFT takes so long to run on a Cortex VS an FPGA
« Reply #3 on: November 20, 2020, 03:08:40 pm »
The official source code is there, if you want to see what it does: https://github.com/ARM-software/CMSIS/tree/master/CMSIS/DSP_Lib/Source/TransformFunctions

A software-based FFT will almost never be as "fast" (cycle-wise) as a pure digital version - that's why we make accelerators. True DSPs are better architectured for this, but even so... you'll be hard-pressed to do better than a fully custom solution.

Now that's the general idea. But when dealing with FPGAs, which have their limitations, that's all a matter of compromise.
One of them is speed: the FPGA-based solution may take ten times fewer cycles - but what's the max frequency it can run at? Compare it to the frequency a typical Cortex M7 can run? Cortex M7 MCU can often run at 400 MHz or over. I highly doubt your FFT implementation on this GOWIN FPGA can be clocked at 400 MHz. So what's the max freq you can achieve? So it may still be faster in the end, but likely not 10 times.

Then, it also depends on the number format. For Q15, the FPGA winning may be a no-brainer, but consider you now have to implement floating point FFT. This would take considerable amounts of logic on a typical FPGA, and would likely have to be heavily pipelined in order to reach reasonable frequencies - thus a FP FFT is a lot less likely to be "faster" on FPGA (especially on the one you use, that may be different if you're using the latest Virtex stuff...) than the Cortex M7 (with embedded FPU) counterpart...

Now if your goal is to achieve faster FFT on a Cortex M7, this is likely possible. Looking at the CMSIS source code, I'd say it's definitely possible to write something more optimized for your particular case. You'll need to be familiar with FFT and fixed-point arithmetic.

Just a final thought: you're doing 256-point FFT, I have a hard time understanding how it can be completed in only 1600 cycles on FPGA? The number of iterations for the FFT algorithm is N*log(N), here that is 256*8 = 2048. I can't understand how you can execute a 256-point FFT in less than 2048 cycles? Have you paralleled iterations in the "inner" loop? (I still don't understand for sure the 1600 cycles...)
With  those figures in mind, 18k cycles for the MCU version doesn't look bad: that's about 9 cycles per iteration (including memory access, arithmetic ops, and looping!), so you'd probably need to be very good at hand-crafting assembly code to do better than this...


« Last Edit: November 20, 2020, 03:21:46 pm by SiliconWizard »
 

Offline ali_asadzadehTopic starter

  • Super Contributor
  • ***
  • Posts: 2134
  • Country: ca
Re: why FFT takes so long to run on a Cortex VS an FPGA
« Reply #4 on: November 20, 2020, 03:35:43 pm »
The Fmax on Gowin is around 250MHz, And yes you can do it around 1600 Cylcles on an FPGA,
Here is the radix-2 version on ISE and xilinx


And you can achieve even better with radix-4 (almost 2X the speed)

 

Need even more speed why note use Double Clock FFT
https://github.com/ZipCPU/dblclockfft
« Last Edit: November 20, 2020, 03:39:14 pm by ali_asadzadeh »
ASiDesigner, Stands for Application specific intelligent devices
I'm a Digital Expert from 8-bits to 64-bits
 

Offline chris_leyson

  • Super Contributor
  • ***
  • Posts: 1556
  • Country: wales
Re: why FFT takes so long to run on a Cortex VS an FPGA
« Reply #5 on: November 20, 2020, 04:15:20 pm »
Radix-2 Cooley-Tukey requires 0.5 N log2 (N) complex multiplications so that's 1024 complex multiplications for 256 point CFFT. A complex multiplication needs 4 real multiplications and 2 additions or 3 multiplications and 5 additions using Gauss' method. You could of course use the two for the price of one algorithm and compute an N point RFFT with an N/2 point CFFT. Using the two for the price of one algorithm you could use a 128 point CFFT so that's 0.5 * 128 log2(128) = 448 complex multiplications or 1792 real multiplications. Using 2 multipliers you could reduce the number of clock cycles. I need to work out the number of additions for the two for the price of one algorithm but 1600 clock cycles does not seem unreasonable.

http://www.robinscheibler.org/2013/02/13/real-fft.html

EDIT: Thanks ali_asadzadeh you beat me to it  :)
« Last Edit: November 20, 2020, 04:17:07 pm by chris_leyson »
 
The following users thanked this post: ali_asadzadeh

Offline SiliconWizard

  • Super Contributor
  • ***
  • Posts: 17550
  • Country: fr
Re: why FFT takes so long to run on a Cortex VS an FPGA
« Reply #6 on: November 20, 2020, 04:39:29 pm »
I had assumed you had access to the HDL code for FFT, but if it's IP stuff, I don't know. Maybe Xilinx generates readable HDL for this - would have to check, just to see what they do. (Looking in the 'ipcore_dir'...)

Can a Radix-2 , Q15 FFT really run at 250 MHz on this GOWIN FPGA? If so, I'm impressed as I didn't expect this on those. Still keep in mind what I said - that's still not as fast as a typical Cortex M7, so you can't compare cycles, except theoretically (but the theoretical explanation was given and is quite simple, there's significant overhead in sofftware compared to custom logic.)

To explain the ~1600 cycles, I had to re-read my FFT implementation I wrote a long time ago (and still use). N*Log(N) is the order of the number of iterations we all know, but to get the exact number of iterations, it's a bit more complex.

The required number of iterations for a 256, radix-2 FFT is actually exactly 1024. (Out of curiosity: it's 2304 cycles for a 512-FFT and 5120 for a 1024-FFT.) Of course for a radix-4 version, that's basically half the iterations.

The 1670 cycles are probably explained by the pipelining implemented for the FPGA version. And the 18k cycles for the MCU version, that's now ~18 cycles/iteration on average. Still not bad for what has to be done, but then it looks more "optimizable", especially if you use the DSP extension (normally available on this Cortex M7?) in hand-crafted assembly. Again, take a look at the source code (the CMSIS one or another, lots of FFT implementations out there) to understand what it takes for each iteration of FFT for a software implementation.

 

Offline SiliconWizard

  • Super Contributor
  • ***
  • Posts: 17550
  • Country: fr
Re: why FFT takes so long to run on a Cortex VS an FPGA
« Reply #7 on: November 20, 2020, 05:24:06 pm »
Radix-2 Cooley-Tukey requires 0.5 N log2 (N) complex multiplications so that's 1024 complex multiplications for 256 point CFFT.

Yes. Difference between the big-O notation and the exact number of iterations. Had forgotten the 0.5 factor. As I said in the above post, I'm curious what explains exactly the 1670 cycles, although it seems obvious it's related to pipelining and/or initialization of some kind.

Now thinking of the bit reverse addressing: you can of course use a lookup table (which is what I did for a general-purpose software implementation), but some DSPs have a bit reverse addressing mode, which can further optimize things. Likewise, on FPGA, you may also directly "bit reverse" the index instead of using a lookup table. Depending on the number of bits of course, it will cost less than a table and may be faster (this point probably hugely depends on the number of bits!)
 

Offline chris_leyson

  • Super Contributor
  • ***
  • Posts: 1556
  • Country: wales
Re: why FFT takes so long to run on a Cortex VS an FPGA
« Reply #8 on: November 21, 2020, 07:45:47 am »
@SiliconWizard. I could well be wrong about the 0.5 N log2(N) complex multiplications, will have to check it's been a while.
 

Offline SiliconWizard

  • Super Contributor
  • ***
  • Posts: 17550
  • Country: fr
Re: why FFT takes so long to run on a Cortex VS an FPGA
« Reply #9 on: November 21, 2020, 04:11:53 pm »
@SiliconWizard. I could well be wrong about the 0.5 N log2(N) complex multiplications, will have to check it's been a while.

I myself was only talking about the number of *iterations*, and not the number of arithmetic operations. And it's indeed 0.5.N.log2(N) *iterations* for the vanilla radix-2 FFT. As I said above, I had to check (because I also wasn't sure...)

Each radix-2 complex FFT iteration requires 4 multiplications. On FPGA, depending on how you pipeline the whole thing you can use fewer than 4 multipliers. In software, having SIMD and/or multiply-and-add instructions can significantly reduce execution time.

 

Offline Nominal Animal

  • Super Contributor
  • ***
  • Posts: 8299
  • Country: fi
    • My home page and email address
Re: why FFT takes so long to run on a Cortex VS an FPGA
« Reply #10 on: November 22, 2020, 06:00:42 am »
2-point DFT is \$[ x_0 + x_1 , x_0 - x_1 ]\$, and 4-point is \$[ x_0 + x_1 + x_2 + x_3 , x_0 - i x_1 - x_2 + i x_3 , x_0 - x_1 + x_2 - x_3 , x_0 + i x_1 - x_2 - i x_3 ]\$.  All other lengths require multiplication by a nontrivial complex number.

It's been a long time since I last dealt with DFTs (since I haven't done any FPGA stuff yet).  In application-land, FFTW library is often used (it's licensed under GPL, but MIT also sells other licenses for it).  For specific length DFTs, one can use "wisdom" to find the most efficient method of calculating the DFT; it's slow to find out, but massively speeds up the actual DFT calculation.

However, in C, the radix-2 FFT of 256 real samples can be implemented as
Code: [Select]
const complex double w256[256]; /* w256[k] = cexp(-M_PI*I*k/128.0) */

static inline size_t bitreverse8(size_t); /* Reverse bits in an 8-bit value */

void fft256(complex double y[256], const double x[256])
{
    for (size_t k = 0; k < 256; k+=2) {
        const double  u = x[bitreverse8(k)];
        const double  t = x[bitreverse8(k+1)];
        y[k] = u + t;
        y[k+1] = u - t;
    }

    for (size_t s = 2; s < 8; s++) {
        const size_t  dk = 1 << s;
        const size_t  jn = dk >> 1;
        const size_t  dw = 256 >> s;
        for (size_t k = 0; k < 256; k += dk) {
            size_t  wi = 0;
            for (size_t j = 0; j < jn; j++) {
                const complex double  t = w256[wi] * y[j + k + jn];
                const complex double  u = y[j + k];
                wi = (wi + wd) & 255;
                y[k + j] = u + t;
                y[k + j + jn] = u - t;
            }
        }
    }
}

static inline size_t bitreverse8(size_t value)
{
    size_t  result = value;
    result = ((result >> 4) & 0x0F) | ((result & 0x0F) << 4);
    result = ((result >> 2) & 0x33) | ((result & 0x33) << 2);
    result = ((result >> 1) & 0x55) | ((result & 0x55) << 1);
    return result;
}
using the example in the Wikipedia page on Cooley-Tukey FFT algorithm.  This can be trivially adapted to any power-of-two FFT size, noting that 28 = 256.

For 256 points, it boils down to 768 complex multiplications (or 3072 real ones, since \$(a + I b)(c + I d) = (a c - b d) + I(a d + b c)\$).  The complex coefficients are essentially powers of complex roots of one; so, just a complex number of unit magnitude rotating around the complex plane.  The multiplication by w256[wi] really just rotates the multiplicand on the complex plane by 360×wi/256 degrees.

Among those 3072 real multiplications, there are only 129 unique real constants, if one ignores the sign, but includes 0.0 and 1.0.  So, technically, if you implement the 127 unique real scalers, you don't need a multiplication unit at all, only additions and subtractions.  (In binary form, those scalars have 1/3 to 2/3 of their bits set.)

One can do some of the k and/or j loop iterations in parallel, too.  (Note that the first, separate k loop is not just reordering data, it is actually doing the first DFT iteration as well.  The logical multiplication constant is w256[0] == 1.0 here, and is therefore omitted.)
« Last Edit: November 22, 2020, 06:06:03 am by Nominal Animal »
 
The following users thanked this post: ali_asadzadeh

Offline hamster_nz

  • Super Contributor
  • ***
  • Posts: 2860
  • Country: nz
Re: why FFT takes so long to run on a Cortex VS an FPGA
« Reply #11 on: November 22, 2020, 06:37:50 am »
I just spent the last few nights implementing FFT from just a general understanding of how it works.

The math and symmetry behind it is quite neat. Combining the results of two interleaved FFTs to double the frequency resolution. Also found a few gotcha that really aided my understanding.

 Would recommend others give it a go too.

Here is my code if anybody wants to point and laugh. https://github.com/hamsternz/my_fft

Oh, and I was quite chuffed to discover how you can use FFT with data sets that isn't a power of two.  (e.g. If you needed exactly 1Hz bins on 48k data)

« Last Edit: November 22, 2020, 06:48:43 am by hamster_nz »
Gaze not into the abyss, lest you become recognized as an abyss domain expert, and they expect you keep gazing into the damn thing.
 
The following users thanked this post: GromBeestje

Offline ali_asadzadehTopic starter

  • Super Contributor
  • ***
  • Posts: 2134
  • Country: ca
Re: why FFT takes so long to run on a Cortex VS an FPGA
« Reply #12 on: November 22, 2020, 09:17:04 am »
Thanks for sharing! >:D >:D >:D
ASiDesigner, Stands for Application specific intelligent devices
I'm a Digital Expert from 8-bits to 64-bits
 


Share me

Digg  Facebook  SlashDot  Delicious  Technorati  Twitter  Google  Yahoo
Smf