With cleverness, chip select and load can be combined to get the interface down to 4 pins for clock, data out, data in, and CS/Load. On the programming side, the interface is polled through an interrupt service routine slaved to a timer which is probably acting as the local CPU interrupt task timer anyway. The ISR handles debouncing since it already has a stable time interval to work with. The main program accesses the input and output state through shadow registers.
I'm already doing that to light up the pushbuttons' LEDs. 74HC595 used.
CPU interrupt task timer - didn't know how to call it, thanks.
On these little microcontrollers, I always end up dedicating one timer to generating a periodic "upkeep" interrupt which counts down to trigger a slower "upkeep", etc. That allows various upkeep routines to be executed at different time intervals while only taking one timer peripheral.
Often the upkeep routines will continue to run even if the main program crashes so basic monitor functionality can be built in for diagnostic purposes.
The main program accesses the input and output state through shadow registers.
Or in other words the main program would be a state machine with the state set by the value of the inputs and/or the task timer value? That would be my approach. I want to be able to read pushbutton presses and changes in the encoder states while communicating with SPI & I2C peripherals and updating a UC1701 based display in parallel mode.
I do not know if they still use the term "shadow register" but it was common back when the 6502 was used. It is just a global memory location which is used to communicate between the main routine and any interrupt routines or hardware. A shadow register often mirrors an actual hardware register. On the Atari 800, there was a whole set of shadow registers which were copied out to hardware registers on every jiffy (1/60th of a second linked to the video vertical refresh rate) by an ISR. Software could not read the output only hardware registers but could read the shadow registers to find out what the current programmed hardware state is. These days you can usually read the output state directly so the shadow register is built in.
A set of shadow registers for instance could be copied in and out to hardware periodically by the ISR (or hardware via DMA) to update the display. For reading keyboards, I implemented a couple of state registers and shadow registers so on a detected keypress down, the bits in a shadow register are set by the ISR for the main program to read and reset. Shadow registers for counting up and down could be used for keypress controlled continuous value scrolling or whatever.
As I already have a 74HC595 and cascading them doesn't require extra MCU pins I think I'll go with something similar to the solution presented by dannyf here.
The 74HC595 can also be cascaded with the 74HC138 so inputs and outputs are done simultaneously.
This timer/ISR/shadow register method of programming would probably not work well for reading encoders for performance reasons; they would have to be handled separately although I might use shadow registers between the encoder routines and one level of the CPU housekeeping interrupt task levels.
I prefer using SSI type shift register logic for I/O expanders controlled with an SPI interface which can be bit-banged if necessary. With cleverness, chip select and load can be combined to get the interface down to 4 pins for clock, data out, data in, and CS/Load.
You can combine data in & out on one pin with a resistor. And possibly also share clock or latch with another function.
Or drive latch via an RC filter from clock to make it automatically latch after receiving a clock burst.
I almost wrote about combining the input and output using a resistor and tristate capability of the microcontroller I/O pin but this excludes the possibility of using the existing SPI peripheral for faster access.
I never bothered deriving the latch from the clock because I worried that the read latch would come too late. I always included 74165s and 74595s in series so CS falling latched the inputs and CS rising latched the outputs.
This way of doing things however also means every read is a write and every write is a read explaining the need for shadow registers to hold output and input state.