I guess the same question applies to the Goertzel-detector as well? As Goertzel has a very narrow bandwidth, its impulse response will be very long too (actually it is a resonator as the pole is on the unit circle), thus its output will rise slowly until it reaches the steady state. So far I haven't seen any information which says that Goertzel-detector cannot be used to detect signals that are shorter than the Goertzel's impulse response or rise time. For the shorter signals the Goertzel's output power will be less than the maximum available after steady state, though. Please correct me if I am wrong here.
The idea is to use a bandpass IIR-filter to process the sampled signal buffer (4096 samples), and compute the RMS (energy) of the filter output over the all samples processed: kind of RMS-detector with a filter applied to its input signal. At least this seems to work in practice, but I do not know whether this works in theory.
Edit: Changed wording.
Edit 2: I do understand that when using a shorter signal than IIR filter's impulse response, the filter will not reach the steady-state. Now, if we change filter parameters (center frequency or bandwidth), the output energy between two filters will be different due to different rise-times, and it is necessary to compute a correction/calibration factor for each filter to be used. Since we are working in a digital domain, computing these calibration factors is quite trivial, though.
A regular FIR filter is based on
linear convolution, which is rather supposed to be applied to an continuous infinite stream of samples. If you instead apply linear convolution to a finite number of samples, then steady state is reached only after the the length of the filter's impulse response, and the leading filtered samples are "garbage". For IIR, baiscally the same applies, but the impulse response length is actually infinite, so an arbitrary end of the impulse needs to be defined, at a point where it returns "close enough" to zero.
Goertzel is under the hood a DFT, calculated for only a single frequency (or a snapshot of a STFT, calculated for a single chunk of samples at a particular point in time, and calculated only for a single frequency).
DFT treats the samples as if they were
circular. The window function smooths the wrap-around discontinunity between end and start, reducing spectral leakage.
But a DFT can be also interpreted as
filter bank. According to the filter bank interpretation, the chosen window function is the impulse response of a prototype low-pass filter, which is under the hood converted to a bandpass and applied to each DFT frequency bin (see previous link). The DFT window has always the same size as of the number of samples, it cannot be longer. While there exist various commonly used window functions (Hann, Hamming, Kaiser,...), basically any FIR lowpass with N taps (where N is the DFT size -- number of samples) could be used as window function in order to give the filter bank the desired frequency response (of course, if you need special properties like "constant overlap add", this may limit the choice of suitable window functions, but this is not an issue here).
A window function with 4000 "FIR taps" (for a 4000 point DFT) is already quite a large number, thus enabling already a pretty narrow bandwidth. But the minimum feasible bandwidth is eventuall limited by the number of samples. And my feeling is that there is a trade-off between stop-band rejection (-> power of out of band frequencies leaking to the filter output) and ENBW (equivalent noise bandwidth -> i.e. noise power picked up inside the passband).
Integrating the power of the band-pass filtered signal is certainly a valid procedure (granted that the band-pass filtering is valid in the first place). Advantage of Goertzel is the implied bandpass filter, at low computational cost, and it collects both, amplitude and phase, so that phase differences between subsequent readings can be measured. Phase measurements are more sensitive to noise than the amplitude measurements, though. For phase confidence of 1° (standard deviation), you need an effective SNR of better than 30dB.
The effects of filter's coefficient quantization and available numeric range needs to be considered carefully as well if wanting to achieve very narrow filters. If the ratio of the filter's center frequency or filter's 3dB point relative to the sampling frequency (800+ kHz) is very small, is might be practical/necessary to perform some decimation prior filtering in order to get the filter coefficients into practical numeric range.
Sure, numeric ranges need to be planned carefully. I tried to check the effect of quantizaiton. Quantizing the coefficients of a 4k point Kaiser window to 16 bits seems to reduce the window's stop band attenuation to ~120dB. The question is whether this is enough or whether more than that is required? 32-bit Q31 arithmetic should not be a problem for the cortex M3. The accumulator can also be 64 bits if necessary. For real-time processing there are less than 84 cycles per sample available, which rather rules out too complex filtering - as you said yourself. Even a decimation filter with good stopband attenuation might be already too expensive. Goertzel applied to the undecimated data is computationally not so expensive, so it seems well feasible, OTOH.
Btw, could you post the raw data from the previous test?
Edit: You may be interested in
this paper, too.