The extra processing needs for a non zero ZF are not really high:
1) Step the virtual ZF phase by an addition and cut away the highest bits
2) get sine and cosine values from a table
3) 2 multiplications (16 bit integers) for each ADC channel and 4 additions/subtractions to combine the 4 signals
This should not need much more than about 1% of the processor speed if the sampling rate is not very high.
I thought about this approach, but right after having built the I / Q mixers on the prototype and having tested them with a lab signal generator providing the I/Q LO signals. Next I implemented the CPLD clock generator just because I can do it this way.
This project "just happened" bit by bit and step by step, without having a polished concept of how to do it at the very beginning. So, yes, some design decisions would have been different if this project was properly planned.
Indeed, there's a routine doing the vector rotation you describe in the firmware, for removing the short term LF residue of the LO / DCF77 carrier difference from the received signals. The (voltage controlled) TCXO is controlled rather long term, so the I/Q mixers output "DC" after a significant time of stabilizing. Before that stablisation is achieved, the I/Q-signals are at a low frequency (0.1Hz), which is removed in the firmware. It's not shown in the block diagram.
BTW, the link NivagSwerdna provided shows a nice example of "mixing down to DC" by sampling the signal at the carrier frequency / an exact multiple of the carrier frequency.
Also it does the samples processing in a loop that must execute within one period of fc, my approach is rather "brute force" by storing the samples in buffers and running loops across the data.