I am a hobbyist with no FPGA experience at all, but I would look at Lattice ICE40LP384-SG32, which Mouser sells for 2€ in singles, and has plenty in stock; I
believe but am not certain that it is supported by the open source
IceStorm tools too.
I would use five pins between the microcontroller and the FPGA: SCLK, MISO, MOSI, /CS (optional), and a separate pin to reset the FPGA multiplexer if a SPI transfer is interrupted.
Each transfer would begin with a 32-bit routing word. The first 8 bits would control eight separate /CS pins, optionally connected to a TI TXU0304 voltage level shifter enable pins (allowing one to electrically disconnect a SPI sub-bus). The next two bits would control the SPI mode, next four bits the word size (4-16), and the last 18 bits the transfer size in words. (Maximum single transfer would then be 262,144 words. Alternatively, use the 22 last bits for the transfer size in bits.)
The following bits would be copied over the FPGA, then all /CS pins disabled and input word size reset (to 8 or 16 bits, doesn't matter); and if possible, the bus-side SCLK, MISO, and MOSI tri-stated. For synchronization purposes, a routing word with all zeros should be a no-operation.
On the bus side, there would be 11 pins: SCLK, MISO, MOSI, and eight /CS pins. Passing the clock and data signals through the FPGA ensures the /CS pin stays in sync. Internally, the FPGA can implement an up to 32 SPI clock cycle FIFO (meaning the output is 32 bits later than the input), synchronized to the input SCLK, as long as either the microcontroller appends 32 extra SPI clock cycles after the last transfer, or the FPGA can synthesize the extra 32 clock cycles at the same frequency as the input SCLK. (The zero routing word is not needed in the middle of a DMA transfer, only at the end.)
So, it really should not be a complicated state machine at all.
I did wonder whether one could use one of the cheap (<$1) ARM Cortex-M0/M0+ MCUs for this instead. Both HK32F030M and CW32F003 have only one SPI, but
STM32G030F6 (0.44€ in singles at LCSC, 1€ at Mouser) has two. The trick with this approach is that the bus SPI SCLK frequency must be equal or higher than the SCLK frequency from the main microcontroller; perhaps by tying its clock to the main microcontrollers oscillator, or gating the main microcontroller SCLK as the bus SCLK somehow. Otherwise, the data starts piling up in the bus multiplexer. Unlike I²C, SPI has no clock stretching to throttle the data.