Forgive me if I get something wrong, I really haven't read the whole thread, just skimmed (always dangerous) but I if you simply wish to have lots of inputs, and lots of outputs, then I see no need to tristate anything.
This is especially true if you are using a teensy, which has more than one SPI port, you can dedicate an SPI port to output and input shift register (74HC595 for output and 74HC165 for input). I'm not sure why there is talk of buffers, tristating, and extra logic. Sure you'll need an extra latch pin, but this doesn't have to be part of the SPI port.. often people bitbang the SS/CS line anyway. You'll also need an extra line for the 165's to latch in their inputs to the register. I would daisy chain the 595's to MOSI, and the 165's to MISO. There are a few ways to do this, but having tristate buffers for each shift register so that you can read in from the 165's without writing/updating the output register seems over kill.
I think perhaps you're concentrating too much on the SPI spec rather than just using SPI to drive shift registers (which can have different control lines to actual SPI). The SPI port, in master mode, is just being used as a sift register and clock generator.
From what I have read - again could be wrong! - there is some confusion because there are lots of terms that can be inter-changable depending on the application, /OE, /CS, /SS, latch, Rclk, clock enable etc..
Diagrams are much better.