I've had success with tying the output of an S/Pdif transmitter directly to an RS-485 driver.
Bitrate of S/Pdif is about 1.5Mbits/s and Cat 5 is perfecty suited for long runs of wire with an RS-485 driver.
On the other end of the wire, the signal gets picked up by 2 RS 485 transceivers.
One RS-485 transceiver picks up the (possibly degraded) signal from the wire and outputs a logic signal. The third RS-485 transceiver is used to send the signal to the S/Pdif transformer in the remote equipment.
I actually use this with a distributed system, where the same S/Pdif data is send to multiple remote nodes. RS-485 is perfecty suited for that. There are no delays added to the signal, except for the length of run or cable, and propagation delays through the drivers, which can all be safely neglected. (Grace Hopper said that a ns is about 30 cm long). You can distribute high quality audio through a house with multiple taps and no chance of stuff ever getting out of sync.
This works very well, but it has to be designed properly. A wrong cable termination can completely demolish your signal.
But for you standard Ethernet is probably better suited.
You can use microcontrollers with chips like the Wizznet W5500, but you have to do some calculations if the whole chain can manage the bitrate for the audio quality you need.
Nowadays it is also common to have microcontrollers with built in Ethernet. Some of the STM32F400 series have built in Ethernet.
Ethenet does add some latency. This is in the order of ms, and unlikely to be an issue for you. But if you have a system in which audio and video are combined you may loose lip sync if you send the audio over ethernet.
Talking mouths with no sound or speach after the actor shut his mouth is annoying and destroys the movie experience. Using a uC with Ethernet also gives you a lot of design flexibility. You can add compression / encryption if required (maybe your next customer requires it). You can also add other features, such as call buttons, blinking led's or add a camera with motion detection and have it signal when someone starts walking in the direction of the front door or cars drive into a drive-in.
You also get the use of standard switches and routers to maintain signal integrity.
Part of Ethernet is signal tuning to match driver and receiver and compensate for cable influences.
Ethernet also reduces the probablility of single point of failure. If in the RS-485 system the cable is shorted anywhere, or even if it is open, you get impedance mismatches and the whole system collapses.
With ethernet and uC's you can also easily add diagnostics data to your protocol. Plug in a cable and the box gives instant feedback if the remote audio node is connected and if it works.
If you want to design a system with a total installed base of 3000 nodes, you want reliability. You do want one of your customers calling you every week with problems. With 3000 nodes development costs are also not so much of an issue. Lots of vendors for the bigger uC's have software stacks available for using the Ethernet controllers on their chips and for audio compression. Some are free, others have commercial licenses.
With Ethernet you can also add remote diagnostics. You can talk to the customers hardware instead of to a frustraded customer over the phone.