some Cortex-M processors have "lazy" FPU context push that only happens when the ISR hits the first FPU instruction.
I don't think this is the case. This will prevent other optimizations like tail-chaining.
I should work. I don't see why it should interfere with tail-chaining, either. Effectively, it changes the "internal return address" of the ISR to different microcode. Sort-of CISC-y, but not too bad. (There are already effectively two different sets of ISR epilogue depending on whether FP save is enabled at all...)
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dai0298a/BCGHEEFD.html(Back in the day, we sped up our ISR by not using floating point anywhere, and moving the main CPU registers into the FP registers, instead of stacking them. (on MIPS or PPC, I think. Probably without multiple interrupt priorities, or perhaps only for the ISR level that we considered most important.))