Author Topic: why FFT takes so long to run on a Cortex VS an FPGA (Read 4584 times)

ali_asadzadeh · « **on:** November 20, 2020, 01:42:19 pm »

Hi,
I have made some tests to compare the results with a Gowin and a Cortex M7 part
FFT size is 256 and it has 16bit data,

The code for the Cortex M is located in this tread, nothing special
https://www.eevblog.com/forum/microcontrollers/why-arm-cortex-m-fft-functions-for-q15-are-not-working!/

Also I have done it with Gowin FPGA and it has used only 2 multipliers and a few Block Ram's to hold the twiddle factors,

The Cortex M needs around 18K of clock pulses, and Gowin with almost no logic (around 1100 LUT), can do it around 1600 clock pulses! about 10X faster

So what's the bottleneck in the Cortex, is there a way to make it faster?

chris_leyson · « **Reply #1 on:** November 20, 2020, 02:30:25 pm »

All of the additional instructions to move data into and out of ram accounts for a substantial overhead whereas in an FPGA the addressing can be done on the fly. DSP chips have dedicated address generators which take care of moving the data around but might only have a single ALU so there is quite a bit of overhead if you're doing a complex multiply or complex multiply accumulate. In an FPGA you can build a custom ALU to do complex MAC operations in a single clock cycle. Address generators and the parallelism of FPGAs really do speed things up. Draw out the signal flow graph for a complex butterfly and count up the number of data paths for both real and imaginary data. In an FPGA you can move all of that data around in a single clock cycle whereas a general purpose processor might only have a single data path and only one ALU. DSP chips will have two data paths and two ALUs. It's all about data flow.

Doctorandus_P · « **Reply #2 on:** November 20, 2020, 02:34:05 pm »

ARM Cortex is quite a big family with different instruction sets.
Maybe some compiler setting is wrong and and not all capabilities of your uC are used.

If you want to check timings. A quite simple yet powerful method is to use any of your uC peripherals to send out some data from some parts of your program and then us a logic analyser to catch that data and measure the timing. You can also toggle an I/O pin each time a certain ISR is triggered. All kinds of variations are simple to do and give lots of insight in the program timing while have negligible impact on the overal speed of the uC, which can be very important for hard real-time control systems.

SiliconWizard · « **Reply #3 on:** November 20, 2020, 03:08:40 pm »

The official source code is there, if you want to see what it does: https://github.com/ARM-software/CMSIS/tree/master/CMSIS/DSP_Lib/Source/TransformFunctions

A software-based FFT will almost never be as "fast" (cycle-wise) as a pure digital version - that's why we make accelerators. True DSPs are better architectured for this, but even so... you'll be hard-pressed to do better than a fully custom solution.

Now that's the general idea. But when dealing with FPGAs, which have their limitations, that's all a matter of compromise.
One of them is speed: the FPGA-based solution may take ten times fewer cycles - but what's the max frequency it can run at? Compare it to the frequency a typical Cortex M7 can run? Cortex M7 MCU can often run at 400 MHz or over. I highly doubt your FFT implementation on this GOWIN FPGA can be clocked at 400 MHz. So what's the max freq you can achieve? So it may still be faster in the end, but likely not 10 times.

Then, it also depends on the number format. For Q15, the FPGA winning may be a no-brainer, but consider you now have to implement floating point FFT. This would take considerable amounts of logic on a typical FPGA, and would likely have to be heavily pipelined in order to reach reasonable frequencies - thus a FP FFT is a lot less likely to be "faster" on FPGA (especially on the one you use, that may be different if you're using the latest Virtex stuff...) than the Cortex M7 (with embedded FPU) counterpart...

Now if your goal is to achieve faster FFT on a Cortex M7, this is likely possible. Looking at the CMSIS source code, I'd say it's definitely possible to write something more optimized for your particular case. You'll need to be familiar with FFT and fixed-point arithmetic.

Just a final thought: you're doing 256-point FFT, I have a hard time understanding how it can be completed in only 1600 cycles on FPGA? The number of iterations for the FFT algorithm is N*log(N), here that is 256*8 = 2048. I can't understand how you can execute a 256-point FFT in less than 2048 cycles? Have you paralleled iterations in the "inner" loop? (I still don't understand for sure the 1600 cycles...)
With those figures in mind, 18k cycles for the MCU version doesn't look bad: that's about 9 cycles per iteration (including memory access, arithmetic ops, and looping!), so you'd probably need to be very good at hand-crafting assembly code to do better than this...

ali_asadzadeh · « **Reply #4 on:** November 20, 2020, 03:35:43 pm »

The Fmax on Gowin is around 250MHz, And yes you can do it around 1600 Cylcles on an FPGA,
Here is the radix-2 version on ISE and xilinx

And you can achieve even better with radix-4 (almost 2X the speed)

Need even more speed why note use Double Clock FFT
https://github.com/ZipCPU/dblclockfft

chris_leyson · « **Reply #5 on:** November 20, 2020, 04:15:20 pm »

Radix-2 Cooley-Tukey requires 0.5 N log2 (N) complex multiplications so that's 1024 complex multiplications for 256 point CFFT. A complex multiplication needs 4 real multiplications and 2 additions or 3 multiplications and 5 additions using Gauss' method. You could of course use the two for the price of one algorithm and compute an N point RFFT with an N/2 point CFFT. Using the two for the price of one algorithm you could use a 128 point CFFT so that's 0.5 * 128 log2(128) = 448 complex multiplications or 1792 real multiplications. Using 2 multipliers you could reduce the number of clock cycles. I need to work out the number of additions for the two for the price of one algorithm but 1600 clock cycles does not seem unreasonable.

http://www.robinscheibler.org/2013/02/13/real-fft.html

EDIT: Thanks ali_asadzadeh you beat me to it

SiliconWizard · « **Reply #6 on:** November 20, 2020, 04:39:29 pm »

I had assumed you had access to the HDL code for FFT, but if it's IP stuff, I don't know. Maybe Xilinx generates readable HDL for this - would have to check, just to see what they do. (Looking in the 'ipcore_dir'...)

Can a Radix-2 , Q15 FFT really run at 250 MHz on this GOWIN FPGA? If so, I'm impressed as I didn't expect this on those. Still keep in mind what I said - that's still not as fast as a typical Cortex M7, so you can't compare cycles, except theoretically (but the theoretical explanation was given and is quite simple, there's significant overhead in sofftware compared to custom logic.)

To explain the ~1600 cycles, I had to re-read my FFT implementation I wrote a long time ago (and still use). N*Log(N) is the order of the number of iterations we all know, but to get the exact number of iterations, it's a bit more complex.

The required number of iterations for a 256, radix-2 FFT is actually exactly 1024. (Out of curiosity: it's 2304 cycles for a 512-FFT and 5120 for a 1024-FFT.) Of course for a radix-4 version, that's basically half the iterations.

The 1670 cycles are probably explained by the pipelining implemented for the FPGA version. And the 18k cycles for the MCU version, that's now ~18 cycles/iteration on average. Still not bad for what has to be done, but then it looks more "optimizable", especially if you use the DSP extension (normally available on this Cortex M7?) in hand-crafted assembly. Again, take a look at the source code (the CMSIS one or another, lots of FFT implementations out there) to understand what it takes for each iteration of FFT for a software implementation.

SiliconWizard · « **Reply #7 on:** November 20, 2020, 05:24:06 pm »

Quote from: chris_leyson on November 20, 2020, 04:15:20 pm

Radix-2 Cooley-Tukey requires 0.5 N log2 (N) complex multiplications so that's 1024 complex multiplications for 256 point CFFT.

Yes. Difference between the big-O notation and the exact number of iterations. Had forgotten the 0.5 factor. As I said in the above post, I'm curious what explains exactly the 1670 cycles, although it seems obvious it's related to pipelining and/or initialization of some kind.

Now thinking of the bit reverse addressing: you can of course use a lookup table (which is what I did for a general-purpose software implementation), but some DSPs have a bit reverse addressing mode, which can further optimize things. Likewise, on FPGA, you may also directly "bit reverse" the index instead of using a lookup table. Depending on the number of bits of course, it will cost less than a table and may be faster (this point probably hugely depends on the number of bits!)

chris_leyson · « **Reply #8 on:** November 21, 2020, 07:45:47 am »

@SiliconWizard. I could well be wrong about the 0.5 N log2(N) complex multiplications, will have to check it's been a while.

SiliconWizard · « **Reply #9 on:** November 21, 2020, 04:11:53 pm »

Quote from: chris_leyson on November 21, 2020, 07:45:47 am

@SiliconWizard. I could well be wrong about the 0.5 N log2(N) complex multiplications, will have to check it's been a while.

I myself was only talking about the number of *iterations*, and not the number of arithmetic operations. And it's indeed 0.5.N.log2(N) *iterations* for the vanilla radix-2 FFT. As I said above, I had to check (because I also wasn't sure...)

Each radix-2 complex FFT iteration requires 4 multiplications. On FPGA, depending on how you pipeline the whole thing you can use fewer than 4 multipliers. In software, having SIMD and/or multiply-and-add instructions can significantly reduce execution time.

Nominal Animal · « **Reply #10 on:** November 22, 2020, 06:00:42 am »

2-point DFT is \$[ x_0 + x_1 , x_0 - x_1 ]\$, and 4-point is \$[ x_0 + x_1 + x_2 + x_3 , x_0 - i x_1 - x_2 + i x_3 , x_0 - x_1 + x_2 - x_3 , x_0 + i x_1 - x_2 - i x_3 ]\$. All other lengths require multiplication by a nontrivial complex number.

It's been a long time since I last dealt with DFTs (since I haven't done any FPGA stuff yet). In application-land, FFTW library is often used (it's licensed under GPL, but MIT also sells other licenses for it). For specific length DFTs, one can use "wisdom" to find the most efficient method of calculating the DFT; it's slow to find out, but massively speeds up the actual DFT calculation.

However, in C, the radix-2 FFT of 256 real samples can be implemented as

Code: [Select]

const complex double w256[256]; /* w256[k] = cexp(-M_PI*I*k/128.0) */

static inline size_t bitreverse8(size_t); /* Reverse bits in an 8-bit value */

void fft256(complex double y[256], const double x[256])
{
    for (size_t k = 0; k < 256; k+=2) {
        const double  u = x[bitreverse8(k)];
        const double  t = x[bitreverse8(k+1)];
        y[k] = u + t;
        y[k+1] = u - t;
    }

    for (size_t s = 2; s < 8; s++) {
        const size_t  dk = 1 << s;
        const size_t  jn = dk >> 1;
        const size_t  dw = 256 >> s;
        for (size_t k = 0; k < 256; k += dk) {
            size_t  wi = 0;
            for (size_t j = 0; j < jn; j++) {
                const complex double  t = w256[wi] * y[j + k + jn];
                const complex double  u = y[j + k];
                wi = (wi + wd) & 255;
                y[k + j] = u + t;
                y[k + j + jn] = u - t;
            }
        }
    }
}

static inline size_t bitreverse8(size_t value)
{
    size_t  result = value;
    result = ((result >> 4) & 0x0F) | ((result & 0x0F) << 4);
    result = ((result >> 2) & 0x33) | ((result & 0x33) << 2);
    result = ((result >> 1) & 0x55) | ((result & 0x55) << 1);
    return result;
}

using the example in the Wikipedia page on Cooley-Tukey FFT algorithm. This can be trivially adapted to any power-of-two FFT size, noting that 2⁸ = 256.

For 256 points, it boils down to 768 complex multiplications (or 3072 real ones, since \$(a + I b)(c + I d) = (a c - b d) + I(a d + b c)\$). The complex coefficients are essentially powers of complex roots of one; so, just a complex number of unit magnitude rotating around the complex plane. The multiplication by w256[wi] really just rotates the multiplicand on the complex plane by 360×wi/256 degrees.

Among those 3072 real multiplications, there are only 129 unique real constants, if one ignores the sign, but includes 0.0 and 1.0. So, technically, if you implement the 127 unique real scalers, you don't need a multiplication unit at all, only additions and subtractions. (In binary form, those scalars have 1/3 to 2/3 of their bits set.)

One can do some of the k and/or j loop iterations in parallel, too. (Note that the first, separate k loop is not just reordering data, it is actually doing the first DFT iteration as well. The logical multiplication constant is w256[0] == 1.0 here, and is therefore omitted.)

hamster_nz · « **Reply #11 on:** November 22, 2020, 06:37:50 am »

I just spent the last few nights implementing FFT from just a general understanding of how it works.

The math and symmetry behind it is quite neat. Combining the results of two interleaved FFTs to double the frequency resolution. Also found a few gotcha that really aided my understanding.

Would recommend others give it a go too.

Here is my code if anybody wants to point and laugh. https://github.com/hamsternz/my_fft

Oh, and I was quite chuffed to discover how you can use FFT with data sets that isn't a power of two. (e.g. If you needed exactly 1Hz bins on 48k data)

ali_asadzadeh · « **Reply #12 on:** November 22, 2020, 09:17:04 am »

Thanks for sharing!


EEVblog® Main Site	EEVblog® on Youtube	EEVblog® on Twitter	EEVblog® on Facebook	EEVblog® on Odysee

EEVblog® Electronics Community Forum

Author Topic: why FFT takes so long to run on a Cortex VS an FPGA (Read 4584 times)

ali_asadzadeh

why FFT takes so long to run on a Cortex VS an FPGA

chris_leyson

Re: why FFT takes so long to run on a Cortex VS an FPGA

Doctorandus_P

Re: why FFT takes so long to run on a Cortex VS an FPGA

SiliconWizard

Re: why FFT takes so long to run on a Cortex VS an FPGA

ali_asadzadeh

Re: why FFT takes so long to run on a Cortex VS an FPGA

chris_leyson

Re: why FFT takes so long to run on a Cortex VS an FPGA

SiliconWizard

Re: why FFT takes so long to run on a Cortex VS an FPGA

SiliconWizard

Re: why FFT takes so long to run on a Cortex VS an FPGA

chris_leyson

Re: why FFT takes so long to run on a Cortex VS an FPGA

SiliconWizard

Re: why FFT takes so long to run on a Cortex VS an FPGA

Nominal Animal

Re: why FFT takes so long to run on a Cortex VS an FPGA

hamster_nz

Re: why FFT takes so long to run on a Cortex VS an FPGA

ali_asadzadeh

Re: why FFT takes so long to run on a Cortex VS an FPGA

Share me