It occurs to me that one could daisy-chain two 16 bit Compare modules in an MCU and create a one-chip purely hardware solution, with one Compare module acting as a prescaler for the other.
Example: In the PIC18F family, Timers can be clocked by an external pin. The associated CCP module can control an output pin. The first CCP can take in the original signal, and its output pin can be hardwired to the input pin of the second Timer. The second CCP's output pin then becomes your divided output signal. You'd have 32 bits of divisor, and by setting that to 50% of the desired value you'd get a 50% duty cycle output waveform implemented purely in hardware. If your use of the output signal is edge sensitive, you're good to go.
The firmware's only job would be to configure the Timer and CCP modules. Thereafter, the hardware would run autonomously with no firmware involvement at all. This completely eliminates the question of firmware latency, eases the fear of those concerned about Assembly language, etc. Just put the firmware in a tight loop that does nothing. It might even be possible to halt code execution to save power and reduce noise. Putting the core into a light sleep mode might do it, many peripheral modules can be configured to continue hardware operation while the core is asleep.
The one hangup is maximum Timer input frequency. A quick glance at the PIC18F family reveals that 10MHz is a bit too fast. But the concept still holds, and other families (from other manufacturers?) may support the frequency in question. Looks like the PIC18F would handle inputs under 5MHz, for example.
Just another way of looking at the "problem". An MCU can sometimes be used as a programmable logic block without direct ongoing firmware involvement.