General > General Technical Chat
FFT window function calculation detail
rhb:
1024 saves 9 multiplies per sample, 1155 saves 3 and prime does them all.
If it only takes 1.7x the 1024 time then the CPU and compiler are doing an excellent job of overlapping instructions and optimizing execution order.
If you're doing serious work with the FFT you should calculate the number of FPU operations for multiply-add and add-multiply order.
I'd also recommend getting a copy of Bracewell or at least the pictorial dictionary of transform pairs in the back. With very modest effort and thought you can pretty much sketch the transform of anything on a napkin. Most things can be approximated by the sum of some simple transforms.
NB The integer values for which the exponent is periodic are N/k not k/N. The latter has no integer values. So my previous statement is not strictly correct. Conceptually it is, but I stated the condition incorrectly. I haven't done this stuff to any extent for over 10 years. For serious work I have always looked up the equations, but the Internet doesn't get quite the same standard.
This is not something to slough off. It's a lot of work, but the dividends are large. You've probably done some of these things. My point is a systematic series of experiments is essential to really understanding what is going on and why.
Some test cases you should always run with any new software. Spike input and all ones. You should be able to transform back and forth multiple times without significant errors developing. Same thing for boxcar, Bartlett and Gaussian windows. This will smoke out errors in indexing, normalization and a host of other pathologies. Calculate the average noise level at each iteration as you transform from time to frequency and back many times. Make sure you understand that a "spike" in a series is *never* a spike in fact. It's a sinc(x) function.
Transform a one followed by 1023 zeros, add 1024 zeros in the middle of the transform array and transform back. Look closely at the plot. That is a rough approximation of reality. For a better view use more zeros
Nominal Animal:
--- Quote from: rhb on December 11, 2022, 10:29:35 pm ---Please explain how the FFT works for prime length series. Where is the computational savings?
--- End quote ---
First of all, you implied you're familiar with FFTW, but apparently haven't even read their invited IEEE paper from 2005. If you had, you'd know FFTW implements three prime length algorithms.
Let me describe the theoretical background in a way that everyone can follow.
The mathematical basis for the discrete Fourier transform, equivalently the discrete time Fourier transform of a periodic signal with period \$N\$, is
$$X_k = \sum_{n=0}^{N-1} x_n e^{-2 \pi i \frac{n k}{N}}, \quad k = 0, \dots N-1$$
where \$x_n\$ are the samples in the time domain, and \$X_k\$ are the samples in the frequency domain. The similarities to both Fourier series (sum) and the Fourier transform (integral) are intentional and not an accident, but to understand exactly why, we'd need to delve deeper into the math, including the sampling theorem.
Exploiting the fact that there are many same coefficients \$\frac{n k}{N}\$ has nothing to with fast discrete Fourier transform; it just means that such coefficients can be precalculated and stored efficiently, if one were to evaluate the sums naïvely. (If you had say an FPGA or ASIC with many parallel multiply-add units, you could use the sum directly for a fixed \$N\$ DFT with minimal latency.)
If you consider the amount of computation for all \$X_k\$, you'll see that it involves \$(N-1)^2\$ multiplications, and somewhat fewer additions. This gives it time complexity of \$O(N^2)\$. Time complexity is not about real-world speed per se for any given \$N\$, it is about how the algorithm scales as \$N\$ increases. If you implement the naïve algorithm, then you'll have on the order of \$N^2\$ multiplications.
What we call Fast Fourier Transform is a class of algorithms that can reach time complexity \$O(N \log N)\$.
Cooley-Tukey FFT works for composite \$N\$, where \$N\$ is a product of small prime factors. At the core, it allows you to calculate the DFT of \$N = N_1 N_2\$ samples by calculating \$N_1\$ DFTs of length \$N_2\$, and \$N_2\$ DFTs of length \$N_1\$. When either one is composite, you can do that DFT recursively, giving you the \$O(N \log N)\$ time complexity.
There are two widely used algorithms used for prime \$N\$, Rader's and Bluestein's FFT algorithms, that can both reach \$O(N \log N)\$ time complexity. Yes, they tend to be slower than Cooley-Tukey, but they scale equally well; in other words, that while they tend to be slower than Cooley-Tukey -based approaches, the ratio tends to stay constant as \$N\$ increases. (There is also prime-factor FFT (Good–Thomas algorithm) that works for \$N = N_1 N_2\$ where \$N_1\$ and \$N_2\$ are relatively prime (their largest common divisor is 1), which is unrelated to Cooley–Tukey, but it isn't relevant here for prime \$N\$.)
Both Rader's and Bluestein's FFT algorithms are based on convolution. It is important to realize that this is a fundamental property of the Fourier transform, and not an approximation of some sort.
Rader's exploits the fact we can express the DFT of size \$N\$ using only a DFT of size \$N-1\$, via \$n = g^q \mod N\$ and \$k = g^{-p} \mod N\$ for an integer \$g\$ and \$q, p = 0 \dots N-2\$, forming a cyclic group, i.e.
$$\begin{aligned}
X_0 &= \sum_{n=0}^{N-1} x_0 \\
X_{g^{-p}} &= x_0 + \sum_{q=0}^{N-2} x_{g^q} e^{-2 \pi i \frac{g^{q-p}}{N}}, \quad p = 0 \dots N-2 \\
\end{aligned}$$
Since \$N-1\$ is composite, the \$N-1\$ DFT can be efficiently calculated using Cooley-Tukey FFT, prime factor FFT, and so on. In practice, however, because this is a cyclic group, we can instead pad the inner DFT with zeroes to a power of two length, which is trivially calculated (via Cooley-Tukey) in \$O(N \log N)\$ time. Note that this is not the same as simply padding the original \$x_n\$ with zeroes to a power of two \$N\$, because that would affect the frequency binning, and therefore the results. Here, by padding the inner DFT, we are exploiting the properties of the cyclic group generated by integer \$g\$, and the convolution theorem; and we get the exact same algebraic answers as we would if we did the Fourier transform naïvely. This is crucial to understand, even if one is not well versed in algebraic group theory or the convolution theorem.
Bluestein's FFT algorithm is based on describing the DFT as a convolution of two sequences \$a_n\$ and \$b_n\$,
$$X_k = \overline{b}_k \left( \sum_{n=0}^{N-1} a_n b_k \right), \quad k = 0 \dots N-1$$
where the two DFTs can be calculated using zero-padded DFTs of a suitable power of two \$N\$, \$N \gets 2^m \ge 2N - 1\$. Again, padding these DFTs with zeroes is not the same as simply padding the original \$x_n\$; it is because of the convolution involved that we can obtain the exact desired transform using two DFTs of a larger power of two size.
--- Quote from: rhb on December 11, 2022, 10:29:35 pm ---What you describe is the behavior of Dave Hale's FFT code.
--- End quote ---
Dave Hale is extremely well versed in convolution theorem (see e.g. https://doi.org/10.1190/1.1827164) –– which is not surprising; these geology folks have dealt with FFT more than anyone else, and even invented AutoTune! –– and although I haven't seen his FFT code, I'd bet your favourite beverage that he used either Rader's or Bluestein's before FFTW folks included an implementation.
The convolution theorem is the key here. While padding the sample sequence with zeroes to get an easier DFT size does change the frequency binning, you can circumvent that by using convolution and two (or more) DFTs of such extended sizes \$N\$.
--- Quote from: rhb on December 11, 2022, 10:29:35 pm ---I spent most of my career in the oil industry fixing errors in complex numerical codes for which an FFT was just one of many operations and there were often many FFTs, as in millions of 3D FFTs in one code I worked on. In that environment you *really* care about every nuance of the DFT and fast algorithms for computing it.
--- End quote ---
Many, many people use tools effectively without understanding them fully. Correct understanding is not required for one to be effective.
I have used the Fourier transform, DFTs, DTFTs, and DCTs in multiple different fields myself, so I have had to learn several different facets of the transform.
I first encountered discrete cosine transforms with JPEG image compression and MPEG video compression and MP3 audio compression. (They achieve compression primarily by quantizing the DCT coefficients, dropping the least important coefficients (depending on the coefficients themselves); MPEG only uses that for I-frames, with P- and B-frames also using preceding or preceding and following frames via motion vectors for small image blocks.)
I did some experimental work with low-energy X-ray image processing (when low-energy solid-state X-ray sensors became commercially viable in the '90s), and although I mostly did 2D kernel stuff, I did spend quite some time to understand 3D tomography, including the practical limitations associated with Fourier slice theorem.
My own field is computational molecular dynamics, specifically parallel and distributed simulation of large classical potential models ("force fields" in chemistry). While calculus and linear algebra are much more common tools here, Fourier transforms do pop up here and there; say when you have successfully simulated a nanocluster or nanodot, and want to know its phonon spectra. Ab initio (QM) simulations, however, are all about wave functions, and numerically calculating (using on e.g. density functional theory) the wave function solutions to the Schrödinger equation. A much deeper understanding of the different domains (spacetime and frequency) is needed, with Fourier transform just the introduction to the field.
--- Quote from: rhb on December 11, 2022, 10:29:35 pm ---For a brief period my job was tuning a 3D FFT algorithm on Intel i386 & i860 Hypercubes.
--- End quote ---
So? I've met highly paid "MPI experts" who claimed that using asynchronous I/O is inherently dangerous, just because they couldn't wrap their mind around it, and could talk enough bullshit to silence anyone who tried to explain that doing I/O asynchronously would allow simulations to complete sooner, because that way communication and computation could be done simultaneously instead of sequentially.
Having a job to fine tune a tool, has really nothing to do with fundamental understanding of said tool. I've seen "expert programmers employed by the best game developer companies" fine-tune a bit of code, insisting it is the best anyone can do, even when shown that a different algorithm can do it 10× faster, just because they couldn't grok the different algorithm.
I am by no way implying that you are bad at your job. I'm sure you did good work, and were/are worth every cent. But your contribution to this thread has had serious errors, especially when claiming that FFT does not work for prime sizes. It does, as proven above; these algorithms scale just as well as Cooley-Tukey does. Depending on the algorithm used, yes, the prime-size ones tend to be a bit slower –– by a constant factor –– than say power of two sizes, but they still qualify as "fast Fourier transform" because they scale equally well, \$O(N \log N)\$. At least in the literature, this time complexity is the definition of FFT.
Current versions of FFTW use "wisdom" to determine how to most efficiently calculate an FFT (or other transform) of given size \$N\$. It has several ways of doing so, with the slowest and most thorough simply brute-forcing the various methods, and picking the fastest one for the current machine. Key is, they all produce the exact same results (except for inherent rounding errors, because we're using limited-precision floating-point number formats). This alone should tell you (generic, plural) that Cooley-Tukey-based FFTs are not the only ones, even though they are the fastest ones currently known, and therefore somewhat preferable in high-performance computing. In particular, radix-2 (powers of two) FFTs can be efficiently implemented on SIMD architectures like x86-64 (with SSE, AVX, etc.).
I hope this opens your eyes, or at least angers you enough so you'll check your understanding. What I find despicable, is misleading others based on your own belief, even when shown proof that you could verify for yourself, but you won't, because it would hurt your ego. Despicable.
Nominal Animal:
--- Quote from: electr_peter on December 11, 2022, 11:01:48 pm ---Whats is the point of having zeros in window filter definition in the first place?
--- End quote ---
The point is to have zero zeros, or exactly one zero at one end of the sequence. This lets you choose whether your window function peak (1.0) width is an odd or an even number in width. This affects the frequency response of the filter, even though the effect goes to zero as \$N\$ goes to infinity.
I do believe I already wrote that whatever window function you use, you should examine the frequency response of your implementation, instead of relying on the theoretical response given a particular algebraic window function shape. If you do this, you'll find out in practice how large the aforementioned effect is with any given window size. Similarly, experimenting with the actual FFT window size (not window function size), between \$N\$ and \$2N\$, repeating the same contents for the larger one, should tell you a lot about how the generated spectrum depends on the FFT window size (\$N\$).
Theory is one thing, but we're using discrete math at limited precision, so real-world behaviour rules.
(What I am getting at, is that the FFT size defines the range and interval between frequency bins, and the window function the frequency response. A slightly larger FFT is often faster to compute, but you still want a specific frequency response, so you may choose the different frequency binning but the same frequency response by padding your sample data with zeroes.)
The reason my suggested FFT sizes had two primes and two composites was exactly that "FFT" is not about the speed of the transform at a given size, but how the speed scales as the size increases. The naïve DFT requires on the order of \$O(N^2)\$ operations, but anything that can do it in the order of \$O(N \log N)\$ operations is FFT.
For a better view, you'd compare say four different composite sizes to four different power of two sizes using FFTW (because it uses the aforementioned prime-based algorithms –– apparently, R only uses Cooley-Tukey), after brute-force gathering wisdom for the sizes, and see how they scale.
Say, 1024, 2048, 4096, and 8192; and roughly corresponding primes 1021, 2053, 4093, and 8191. Because of cache effects and such, there will be jumps and such, but the ratio between power-of-two size and prime size is limited; the ratio will reach a limit, and stay there. This ratio is the relative speed between different algorithms, and is the reason why one might choose a power-of-two size instead of a more "natural" size. However, any Fourier transform that scales as \$O(N \log N)\$ –– note, ignoring any scale factors! –– is a Fast Fourier Transfrom by definition.
gf:
--- Quote from: Nominal Animal on December 12, 2022, 09:45:26 am ---I do believe I already wrote that whatever window function you use, you should examine the frequency response
--- End quote ---
An in order to do that, it is a good idea to pad the window function with a large number of zeros, in order to plot the frequency response at high resolution. Too coarse sampling may give a wrong impression of the actual filter shape (for instance, if the frequency response of a Hann window happens to be sampled only at its peak and at its zeros, then you won't see the presence, height and shape of the side lobes).
Nominal Animal:
--- Quote from: gf on December 12, 2022, 01:26:13 pm ---
--- Quote from: Nominal Animal on December 12, 2022, 09:45:26 am ---I do believe I already wrote that whatever window function you use, you should examine the frequency response
--- End quote ---
An in order to do that, it is a good idea to pad the window function with a large number of zeros, in order to plot the frequency response at high resolution. Too coarse sampling may give a wrong impression of the actual filter shape (for instance, if the frequency response of a Hann window happens to be sampled only at its peak and at its zeros, then you won't see the presence, height and shape of the side lobes).
--- End quote ---
Exactly, to get high-resolution frequency bins. Like I added in parens, the size of the DFT determines the range and interval of the frequency bins, and the window function its frequency response (including phase effects). Because of how the frequency bins are distributed, the most useful DFT sizes are multiples of the actual window size you intend to use.
I guess I should amend my assertion then, that zeros are about having zero or exactly one zeros at either end of the window to get the desired properties of the frequency response, because there are two completely separate things one may do with the zeros:
* Give the desired frequency response via the window function having an odd or even number of nonzero coefficients (odd or even number of maximum coefficients)
* Pad the DFT window to a size that is easy to compute and gives nice high-resolution frequency bins
Edited: Removed the continuous Fourier transform, because I found some errors I'll correct in a later post.
Navigation
[0] Message Index
[#] Next page
[*] Previous page
Go to full version