Well, 16x16 is not a big deal. I don't think that it's worth trying to optimize it because of the results of the Microchip App. Note:

http://ww1.microchip.com/downloads/en/AppNotes/00852a.pdf

Their test results with 16 bit IIR and 8 bit FIR coefficients show that at 8kHz of 10 bit ADC uses ~ 2.5MIPS of the 10 MIPS available - which is 1/4. I'm assuming that they spent some effort optimizing it, even if it's not the best optimisation.

That's probably fair for that platform. Their inner loop looks awfully long. I don't know if that's a PIC thing. Possibly; it's an ancient instruction set, and pretty sparse, most ponderously so on registers.

The pile of registers on the AVR lets it plow through DSP stuff like this -- set up all your coefficients and operands, and just chug chug chug away. If register values need to be protected, push/pop goes outside the loop, around the function, so they only cost constant time. The fewer memory sources you have to read/write during the loop, the better; and if they can be accessed with sequentially indexed addressing, all the more (e.g., reading samples or coefficients from a linear buffer).

Most any modern CPU should perform as well or better. At a glance, dsPIC and MSP430 (assuming hardware multiply is included) are probably a few times better than AVR at this task, and PIC32 (MIPS core) and ARM anything (Cortex M0 on up) even better still (wider operands, especially handy for higher precision; more addressing modes; higher clock rates, and in the even faster ones, pipelines and cache).

Or a DSP architecture, but that would be serious overkill for such simple filtering. Setting up the data flows might well be more overhead for such a simple problem on those.

So 10 bits by 16 bits will rarely spread to more than 24 bits. But what if I have to multiply 24 bits ADC data by 16 bits coefficients. If I use fixed point that would give me 48 bit result which I can later round to 24 bits again. Also the MCU time for doing that would be a lot more. Also you said that IIR is a lot sensitive to rounding.

Total number of (possibly required) bits in the product is always the sum of bits in the operands, but 10b x 16b with real signal data probably won't miss the two bit truncation (but you will still need to carry them through, otherwise that would be equivalent to truncating one or the other operand directly, which wouldn't be nice).

For a fixed size hardware multiplier (i.e., 8x8 or 16x16 or whatever), more bits takes O(N^2) operations, so if you're doing 24 or 48 bit MACs, it can get very attractive to use a more powerful instruction set.

IIR is very sensitive, but that's one place where simulation can help. Occasionally, you can play around with the bits, and round it to a convenient number, that happens to give a stable result, but at a slightly different cutoff frequency, or Q factor, or gain. This can be solved before hand, and the compensating gain term can be calculated.

Much the same as selecting rounded component values for the analog filter, but those are much more stable, so it's a lot easier (since the poles lie on the complex plane -- instead of poles inside the unit circle, a much tighter margin).

Point being, if you calculate a filter that requires, say, 18 bit coefficients for stability, but you get lucky and find a set of coefficients that end in two zeroes, you can truncate those to 16 bits, and save a whole bunch of calculations on an 8/16 bit platform!

I could spent a lot of efforts of doing optimized assembly code for the filters, but that'll probably not stabilize the ADC input. With second order IIR I was able to get stable 500.000 mV reading (LS digit playing +-2) at cut off frequency <1Hz (not sure about the exact numbers). The display wasn't updating fast enough. It took seconds to get into 1% reading.

Well... yeah, that's exactly what a filter should do. If you want an aggravatingly slow update, you can filter the piss out of it, and get one really stable number. It might not even be as accurate as it is stable (i.e., you're stretching it out to way more bits than the ADC is even worth), but that's certainly something that can be done.

If you want something nice and responsive, an IIR with rapid settling, or an FIR, or a combination of both, would do well. The IIR, of course, must be calculated in much the same way that you'd calculate an analog filter with rapid settling -- i.e., minimal overshoot and phase delay: a Bessel filter. (The FIR can simply be a "hump" of any suitable shape -- a windowed piece of sin(x), Gaussian, actually sampled Bessel impulse response, etc. But to get comparable performance, it will have to be long.)

There are still other possibilities. If you want a periodic update, where each frame evaluates exactly the data in the proceeding frame duration, then you can just accumulate for a long time, then when it comes time to spit that out to the display, divide by the number of samples and correct for gain and whatnot. The output sample rate is therefore the frame rate, which is less than the input sample rate -- a decimation process. An impulse read during a frame will only show up after that same frame, so the impulse response is finite (= FIR). This is what almost all DMMs perform, whether through analog (multi-slope integration) or digital means.

Tim