Indeed, if you have to saturate and the instruction set does not have saturating instructions, that's a colossal waste of time, especially if you need to do it at multiple steps. But really the algorithm implementation should be so that saturation is needed only as the final step, once per output sample, if at all possible.
Inputs coming from real-world sources are always constrained to begin with and coeffs are constrained, all that is needed is to saturate the output (or even better, guarantee by running worst-case numbers that output never wraps around and doesn't need saturation).
This being said, even if your processing is in floating point, the output often needs to be saturated or range checked anyway, so the same work needs to be done.
Floating point sometimes have performance benefit over integer due to the fact that many cores (Cortex-M MCUs for example) come with a lot more FPU registers than general purpose integer registers; and they are all free to use, not being wasted in trivial things like loop counters.
For example, I hand-wrote decimating FIR filter in assembly, where all the accumulators were in single precision floating point registers (32 registers is massive number of single-cycle storage, without need of load/stores!), eliminating a lot of load and stores. With the loops unrolled, code running from ITCM, and coeffs being loaded from DCTM as a dual-issued (cortex-m7) parellel operation, the performance basically is that of a classical dedicated DSP, even on "just" a general purpose MCU.
But sure, at some point you run out of those 16 (DP) / 32 (SP) registers and can't avoid memory loads and stores. In that case, if you can load four 16-bit values in one go with 64-bit wide AXI bus and then utilize integer SIMD instructions to do whatever you need, fixed point performance will greatly exceed the floating point version.
So the answer really is, it depends.