Please explain how the FFT works for prime length series. Where is the computational savings?
First of all, you implied you're familiar with FFTW, but apparently haven't even read their
invited IEEE paper from 2005. If you had, you'd know FFTW implements three prime length algorithms.
Let me describe the theoretical background in a way that everyone can follow.
The mathematical basis for the discrete Fourier transform, equivalently the discrete time Fourier transform of a periodic signal with period \$N\$, is
$$X_k = \sum_{n=0}^{N-1} x_n e^{-2 \pi i \frac{n k}{N}}, \quad k = 0, \dots N-1$$
where \$x_n\$ are the samples in the time domain, and \$X_k\$ are the samples in the frequency domain. The similarities to both Fourier series (sum) and the Fourier transform (integral) are intentional and not an accident, but to understand exactly why, we'd need to delve deeper into the math, including the
sampling theorem.
Exploiting the fact that there are many same coefficients \$\frac{n k}{N}\$ has nothing to with
fast discrete Fourier transform; it just means that such coefficients can be precalculated and stored efficiently, if one were to evaluate the sums naïvely. (If you had say an FPGA or ASIC with many parallel multiply-add units, you could use the sum directly for a fixed \$N\$ DFT with minimal latency.)
If you consider the amount of computation for all \$X_k\$, you'll see that it involves \$(N-1)^2\$ multiplications, and somewhat fewer additions. This gives it
time complexity of \$O(N^2)\$. Time complexity is not about real-world speed per se for any given \$N\$, it is about how the algorithm scales as \$N\$ increases. If you implement the naïve algorithm, then you'll have on the order of \$N^2\$ multiplications.
What we call Fast Fourier Transform is a class of algorithms that can reach time complexity \$O(N \log N)\$.
Cooley-Tukey FFT works for composite \$N\$, where \$N\$ is a product of small prime factors. At the core, it allows you to calculate the DFT of \$N = N_1 N_2\$ samples by calculating \$N_1\$ DFTs of length \$N_2\$, and \$N_2\$ DFTs of length \$N_1\$. When either one is composite, you can do that DFT recursively, giving you the \$O(N \log N)\$ time complexity.
There are two widely used algorithms used for prime \$N\$,
Rader's and
Bluestein's FFT algorithms, that can both reach \$O(N \log N)\$ time complexity. Yes, they tend to be slower than Cooley-Tukey, but they
scale equally well; in other words, that while they tend to be slower than Cooley-Tukey -based approaches, the ratio tends to stay constant as \$N\$ increases. (There is also prime-factor FFT (Good–Thomas algorithm) that works for \$N = N_1 N_2\$ where \$N_1\$ and \$N_2\$ are relatively prime (their largest common divisor is 1), which is unrelated to Cooley–Tukey, but it isn't relevant here for prime \$N\$.)
Both Rader's and Bluestein's FFT algorithms are based on
convolution. It is important to realize that this is a fundamental property of the Fourier transform, and not an approximation of some sort.
Rader's exploits the fact we can express the DFT of size \$N\$ using only a DFT of size \$N-1\$, via \$n = g^q \mod N\$ and \$k = g^{-p} \mod N\$ for an integer \$g\$ and \$q, p = 0 \dots N-2\$, forming a
cyclic group, i.e.
$$\begin{aligned}
X_0 &= \sum_{n=0}^{N-1} x_0 \\
X_{g^{-p}} &= x_0 + \sum_{q=0}^{N-2} x_{g^q} e^{-2 \pi i \frac{g^{q-p}}{N}}, \quad p = 0 \dots N-2 \\
\end{aligned}$$
Since \$N-1\$ is composite, the \$N-1\$ DFT can be efficiently calculated using Cooley-Tukey FFT, prime factor FFT, and so on. In practice, however, because this is a cyclic group, we can instead pad the inner DFT with zeroes to a power of two length, which is trivially calculated (via Cooley-Tukey) in \$O(N \log N)\$ time. Note that this is
not the same as simply padding the original \$x_n\$ with zeroes to a power of two \$N\$, because that would affect the frequency binning, and therefore the results. Here, by padding the inner DFT, we are exploiting the properties of the cyclic group generated by integer \$g\$, and the convolution theorem; and we get the exact same algebraic answers as we would if we did the Fourier transform naïvely. This is crucial to understand, even if one is not well versed in algebraic group theory or the convolution theorem.
Bluestein's FFT algorithm is based on describing the DFT as a convolution of two sequences \$a_n\$ and \$b_n\$,
$$X_k = \overline{b}_k \left( \sum_{n=0}^{N-1} a_n b_k \right), \quad k = 0 \dots N-1$$
where the two DFTs can be calculated using zero-padded DFTs of a suitable power of two \$N\$, \$N \gets 2^m \ge 2N - 1\$. Again, padding these DFTs with zeroes is
not the same as simply padding the original \$x_n\$; it is because of the convolution involved that we can obtain the exact desired transform using two DFTs of a larger power of two size.
What you describe is the behavior of Dave Hale's FFT code.
Dave Hale is extremely well versed in convolution theorem (see e.g.
https://doi.org/10.1190/1.1827164) –– which is not surprising; these geology folks have dealt with FFT more than anyone else, and even invented AutoTune! –– and although I haven't seen his FFT code, I'd bet your favourite beverage that he used either Rader's or Bluestein's before FFTW folks included an implementation.
The convolution theorem is the key here. While padding the sample sequence with zeroes to get an easier DFT size does change the frequency binning, you can circumvent that by using convolution and two (or more) DFTs of such extended sizes \$N\$.
I spent most of my career in the oil industry fixing errors in complex numerical codes for which an FFT was just one of many operations and there were often many FFTs, as in millions of 3D FFTs in one code I worked on. In that environment you *really* care about every nuance of the DFT and fast algorithms for computing it.
Many, many people use tools effectively without understanding them fully. Correct understanding is not required for one to be effective.
I have used the Fourier transform, DFTs, DTFTs, and DCTs in multiple different fields myself, so I have had to learn several different facets of the transform.
I first encountered discrete cosine transforms with JPEG image compression and MPEG video compression and MP3 audio compression. (They achieve compression primarily by quantizing the DCT coefficients, dropping the least important coefficients (depending on the coefficients themselves); MPEG only uses that for I-frames, with P- and B-frames also using preceding or preceding and following frames via motion vectors for small image blocks.)
I did some experimental work with low-energy X-ray image processing (when low-energy solid-state X-ray sensors became commercially viable in the '90s), and although I mostly did 2D kernel stuff, I did spend quite some time to understand 3D tomography, including the practical limitations associated with
Fourier slice theorem.
My own field is computational molecular dynamics, specifically parallel and distributed simulation of large classical potential models ("force fields" in chemistry). While calculus and linear algebra are much more common tools here, Fourier transforms do pop up here and there; say when you have successfully simulated a nanocluster or nanodot, and want to know its phonon spectra. Ab initio (QM) simulations, however, are all about wave functions, and numerically calculating (using on e.g.
density functional theory) the wave function solutions to the Schrödinger equation. A much deeper understanding of the different domains (spacetime and frequency) is needed, with Fourier transform just the introduction to the field.
For a brief period my job was tuning a 3D FFT algorithm on Intel i386 & i860 Hypercubes.
So? I've met highly paid "MPI experts" who claimed that using asynchronous I/O is inherently dangerous, just because they couldn't wrap their mind around it, and could talk enough bullshit to silence anyone who tried to explain that doing I/O asynchronously would allow simulations to complete sooner, because that way communication and computation could be done simultaneously instead of sequentially.
Having a job to fine tune a tool, has really nothing to do with fundamental understanding of said tool. I've seen "expert programmers employed by the best game developer companies" fine-tune a bit of code, insisting it is the best
anyone can do, even when shown that a different algorithm can do it 10× faster, just because they couldn't grok the different algorithm.
I am by no way implying that you are bad at your job. I'm sure you did good work, and were/are worth every cent. But your contribution to this thread has had serious errors, especially when claiming that FFT does not work for prime sizes. It does, as proven above; these algorithms scale just as well as Cooley-Tukey does. Depending on the algorithm used, yes, the prime-size ones tend to be a bit slower –– by
a constant factor –– than say power of two sizes, but they still qualify as "fast Fourier transform" because they scale equally well, \$O(N \log N)\$. At least in the literature, this time complexity is the
definition of FFT.
Current versions of FFTW use "wisdom" to determine how to most efficiently calculate an FFT (or other transform) of given size \$N\$. It has several ways of doing so, with the slowest and most thorough simply brute-forcing the various methods, and picking the fastest one for the current machine. Key is,
they all produce the exact same results (except for inherent rounding errors, because we're using limited-precision floating-point number formats). This alone should tell you (generic, plural) that Cooley-Tukey-based FFTs are not the only ones, even though they are the fastest ones currently known, and therefore somewhat preferable in high-performance computing. In particular, radix-2 (powers of two) FFTs can be efficiently implemented on SIMD architectures like x86-64 (with SSE, AVX, etc.).
I hope this opens your eyes, or at least angers you enough so you'll check your understanding. What I find despicable, is misleading others based on your own
belief, even when shown proof that you could verify for yourself, but you won't, because it would hurt your ego. Despicable.