Cirrus Logic CS8422 chip can receive S/PDIF and AES3 input and do sample rate conversion to a master clock (and even recover the original sample clock), so three of those; and a small FPGA, for example an iCE40 (I like
Olimex dev boards) running at 32×3.072 = 98.304 MHz, or a microcontroller running at a suitable multiple of 3.072 MHz so it can use e.g. its SPI MOSI pin with a 3.072 MHz clock to generate the outgoing S/PDIF data stream, ought to work.
Upside is that those CS8422 chips are quite capable, so things like sample depth (20 or 24 bit) or sample rate (44100 Hz or 48000 Hz) would be no problem.
Downside is those CS8422 chips cost about 18€ apiece at Mouser (but they are in stock). Mouser does also have TOTX1350(F) transmitters (11€ apiece) and TORX1355(F) receivers (14€ apiece) in stock. Just the main components would cost over 100 euros, and that's excluding the microcontroller or FPGA or any passives needed. Ouch.
Cliff manufactures/'d compatible Toslink transmitters and receivers that for example Conrad sells for 3.70€ (FCR684208T transmitter) and 4.80€ (FC684208R receiver) apiece. TME has some OTJ-5 FCR684205T transmitters in stock for 1.71€ apiece (!) and some ORJ-3 FCR6842032R receivers for 3€ apiece, so those can be found at a more reasonable price. Micro-semiconductor.com claims they have CS8422 in 32QFN in stock for under $5 apiece, but for orders under $1000, shipping will cost $60-$110.
A DIY approach therefore looks doable at a reasonable budget, and quite possibly with features not found even in the pro equipment, but would be quite an interesting project and not a small undertaking.
With a fast FPGA clocked from a multiple of 3.072 MHz with all S/PDIF sources using 48 kHz sample rates, one ought to be able to do it all within the FPGA. Keeping track of the differential Manchester code (biphase mark code) cycle length in FPGA clocks, one should be able to determine the clock drift.
One could use a modest-sized FIFO queue of samples, and when the drift exceeds a full sample, do a fixed-size one-sample expansion or contraction resampling within the buffer. It isn't as perfect as continuous FFT-based resampling, but I bet the distortion caused can be low enough to be within the inherent noise levels. It's the 44.1kHz to 48kHz resampling that is annoying, because it's essentially a 147→160 sample conversion.
An alternate option would be to just do continuous resampling using cubic interpolation, so that each output sample is determined by four input samples, per channel. Using Hermite splines, the response is pretty good (with aliasing well below 40dB), so the actual audio quality should be good to acceptable; just perhaps not "excellent". Sinc interpolation would give much better results – in theory,
perfect –, but the normalized sinc function needed is quite computationally heavy and extends far (needs many samples); especially compared to cubic Hermite splines, which are just third degree polynomials.