Now I use 48kHz input sampling rate. Because the audio I am interested in (telephony) has the range of only about 300Hz to 3kHz, would it make sense I should apply decimation by 4, to produce output of 12kHz (or possibly 6x, downto 8kHz), to save computing time?
If I understand that right, the decimating FIR filter is still calculated as for the higher/input sampling rate, 48kHz in my example? That would lead to very high order of the filters, to get the desired bandwidth/selectivity in the receiver. Does it make sense to design the decimating FIR only to suppress aliasing and then use separate sharper filtering on the downsampled signal?
You can think the decimation as follows:
You have a sample rate of 48kHz but your signal is only in range 300Hz - 3kHz. You could process the signals 300Hz-3kHz at 48kHz sample rate, but you are wasting cycles as the required sample rate is 6kHz (twice the highest signal component).
So, you want to decrease the sample rate. From the sampling theory you know that 6kHz sample rate is the absolute minimum for the 3kHz signal bandwidth. But in order to make your life a bit easier, the 12kHz sample rate is a good compromise as the decimation filter could be a bit relaxed etc. and you won't get aliasing that easily. As a result, you want to perform 1:4 decimation.
Now, you need to filter all the frequency components above 3kHz ie. you want to create a 3kHz low-pass filter at a sample rate of 48kHz. You want to have a FIR filter as it is suitable for decimation process, it has a linear phase response, it is always stable etc.
Here is the best part: Yes, you need to compute the 3kHz low-pass FIR at each sample instant ie. at 48kHz rate. However, as part of the decimation process you will use only every fourth output sample of the decimation filter. As the FIR filter doesn't have any recursive structures as an IIR filter has, you need to just compute the FIR at every fourth sample instant that you are interested in. You don't need to evaluate the FIR filter at those sample instants that you will discard anyway as a result of decimation.
Polyphase? Well, polyphase would distribute the processing load more evenly.
About software architecture. Divide the software into two parts: "foreground" and "background".
First part is the "foreground" process ie. the real-time signal processing part which will be handled in the interrupt service routine at sample rate of 48kHz.
The second part, "background" will be at main() function level handling the UI and other non-real-time stuff. The "background" it will be interrupted by the "foreground" process at 48kHz rate, thus the "background" process will get the remaining processing cycles which are not used by the "foreground" process.