In general, it would have to be something like:
0. Who frikken cares? Trample the 2+ extra bits, and mask off the garbage later.
If the SPI doesn't reset between nCS assertions, but retains cycle count, well, sucks to be you. No seriously, you're going to miss a clock some day and literally all your data will be trash after that. There are a surprising number of ICs that misbehave in this way. A surprising number of which have no way to reset the comm state, you must send a global reset or power-cycle them when they fuck up. Brain dead design, but it's totally real.
As a corollary, you probably can't use such a device on an SPI
bus, either because it always receives clocks, or sometimes mistakenly receives them; or MISO isn't a tristate pin; or...
1. Sequence a DMA transaction with the SPI control register, to set a 6 or 7-bit mode for the last byte or two.
Probably, the DMA isn't complex enough to do this?
2. Trigger an interrupt on the penultimate byte, change the config register, go one (or two) more bytes, then do the usual processing.
More manual, requires CPU intervention -- but doesn't require interaction from the main() thread. Basically #1 but patching in the functionality with an interrupt. Pretty typical case I would guess.
3. Trigger an interrupt on the penultimate byte, and bit-bang out the remaining bits.
Time consuming. Perhaps more time than you can afford in an interrupt. May have to be pushed into a subroutine elsewhere, perhaps a lower priority interrupt, or triggering an event which is polled for in main().
If the bitrate is intentionally on the low side, a timer could be used to take up the space between clock edges, so the last few bits would be received by a timer interrupt. This would save the CPU cycles that a 100%-software bit-bang would need.
I suppose you could also make a hybrid, where you start one last byte (or word) transfer, but start a timer simultaneously, and stop the SPI module right in the middle of the transaction. Tricky to resolve timing, and may not be reliable (ooh, random bit errors!). May not be supported on a given platform (e.g. if the shift register contents are considered invalid (and unreadable) until a full byte/word is complete).
If you need coherent clocking (it's sensitive to timing as well as the number of bits), consider a hardware solution. Discrete logic could do it, but an FPGA would be sensible. Upside, you can make any kind of bridge you like: SPI, I2C, parallel, async serial; to name a few. A full [double-]buffered solution would incur a fair amount of delay (i.e., at least a full frame), but a solution more like bridging clock domains could be used for minimum latency.
Tim