Thinking about the SPI: Do all 12 instances need to run concurrently or can there be just one SPI channel and 12 CS* lines?
I think you should spend a good deal of time considering how many things need to happen at once. SPI can still be fast and some uCs have multiple SPI gadgets but nothing like 12.
OK, how about MANY uCs? They all talk over, say, RS485 to a host like a Pi (or PC). Each has an address so the Pi can tell each uC what to do next and delay telling it to start until setup is complete. Then the Pi sends some weird packet (like with a broadcast address) and each uC starts doing as instructed. One advantage: If the uC is supposed to do what it did the last time (same bit pattern), that information doesn't need to be resent from the Pi.
Maybe spend some time at the FTDI site and look over what they can do with USB.
USB HID is easy to implement and there are 512 bits in a packet. It's easy on the PC end as well. Look at
https://www.pjrc.com/teensy/rawhid.htmlThe new Teensy 4.0 is blazing fast!
So, you have one RawHid gadget taking care of all discrete IO, 12 FTDI gadgets taking care of USB->SPI and whatever else. It would be better if you could use fewer SPI gadgets and more CS* lines.
The downside is you can only guarantee 1000 packets per second. That might be too slow.