2-point DFT is \$[ x_0 + x_1 , x_0 - x_1 ]\$, and 4-point is \$[ x_0 + x_1 + x_2 + x_3 , x_0 - i x_1 - x_2 + i x_3 , x_0 - x_1 + x_2 - x_3 , x_0 + i x_1 - x_2 - i x_3 ]\$. All other lengths require multiplication by a nontrivial complex number.
It's been a long time since I last dealt with DFTs (since I haven't done any FPGA stuff yet). In application-land,
FFTW library is often used (it's licensed under GPL, but MIT also sells other licenses for it). For specific length DFTs, one can use "wisdom" to find the most efficient method of calculating the DFT; it's slow to find out, but massively speeds up the actual DFT calculation.
However, in C, the radix-2 FFT of 256 real samples can be implemented as
const complex double w256[256]; /* w256[k] = cexp(-M_PI*I*k/128.0) */
static inline size_t bitreverse8(size_t); /* Reverse bits in an 8-bit value */
void fft256(complex double y[256], const double x[256])
{
for (size_t k = 0; k < 256; k+=2) {
const double u = x[bitreverse8(k)];
const double t = x[bitreverse8(k+1)];
y[k] = u + t;
y[k+1] = u - t;
}
for (size_t s = 2; s < 8; s++) {
const size_t dk = 1 << s;
const size_t jn = dk >> 1;
const size_t dw = 256 >> s;
for (size_t k = 0; k < 256; k += dk) {
size_t wi = 0;
for (size_t j = 0; j < jn; j++) {
const complex double t = w256[wi] * y[j + k + jn];
const complex double u = y[j + k];
wi = (wi + wd) & 255;
y[k + j] = u + t;
y[k + j + jn] = u - t;
}
}
}
}
static inline size_t bitreverse8(size_t value)
{
size_t result = value;
result = ((result >> 4) & 0x0F) | ((result & 0x0F) << 4);
result = ((result >> 2) & 0x33) | ((result & 0x33) << 2);
result = ((result >> 1) & 0x55) | ((result & 0x55) << 1);
return result;
}
using the
example in the Wikipedia page on Cooley-Tukey FFT algorithm. This can be trivially adapted to any power-of-two FFT size, noting that 2
8 = 256.
For 256 points, it boils down to 768 complex multiplications (or 3072 real ones, since \$(a + I b)(c + I d) = (a c - b d) + I(a d + b c)\$). The complex coefficients are essentially powers of complex roots of one; so, just a complex number of unit magnitude rotating around the complex plane. The multiplication by
w256[wi] really just rotates the multiplicand on the complex plane by 360×
wi/256 degrees.
Among those 3072 real multiplications, there are only 129 unique real constants, if one ignores the sign, but includes 0.0 and 1.0. So, technically, if you implement the 127 unique real scalers, you don't need a multiplication unit at all, only additions and subtractions. (In binary form, those scalars have 1/3 to 2/3 of their bits set.)
One can do some of the
k and/or
j loop iterations in parallel, too. (Note that the first, separate
k loop is not just reordering data, it is actually doing the first DFT iteration as well. The logical multiplication constant is w256[0] == 1.0 here, and is therefore omitted.)