Apparently one solution to this is to window the data
To be technically correct, you are always "windowing" data even when you do nothing. Simply grabbing a block of data means that you effectively multiplying by a "rectangular" window. There is no such thing as a block of data with "no window"
Any window has its own DFT (Rectangular, Hanning, etc). Since all sampled blocks of data in time domain are effectively multiplied some some window, the spectral leakage you see is the *convolution* of the windows's DFT and the true signal's DFT in the Frequency domain.
In the case if the rectangular window, its spectrum is a sync function.
All windows have this effect. To see it, take the DFT of your window and you will get the profile. If you apply that window to say a sine wave, the DFT will result in the window being *shifted* to the bin of the frequency of your sine way. (convolution of a delta function [sine wave] in the frequency domain can mathematically be shown to be a shift).
For a complicated signal, you end up with a spectrum that is the convolution of the
spectrum of the window and the
spectrum of *real* signal.. The assumption is that the signal is periodic beyond the boundaries of the window.
I'd go a different route. IMHO FFT isn't very usefull in tone decoding / rythm detection algorithms. Tones may fall between FFT bins so the level gets lost and you are creating a lot of results you don't need. I'd start with band filtering and go from there. A very effective way to detect frequencies is a frequency counter. This gives you frequency/level detectors with infinitely sharp band filtering and low latency. Band filtering can be done with (elliptic) IIR filters which don't need much computational resources.
If single tones are what you want, then you need Goertzel's Algorithm:
https://en.wikipedia.org/wiki/Goertzel_algorithmIt is the most compact way to approach the problem. You can pull out Frequency and phase as needed. (It is essentially a complex valued IIR filter).