For USB transfers, the usual solution is to do asynchronous mode transfers.
I know; I didn't even realize USB Audio uses isochronous transfers.
Similar issues do occur elsewhere, too. One is when you have more than one DAQ (because you are measuring many things at once), at high sample rates, and you want to synchronize the data. I am not at all familiar with USB audio, but I do have experience on the software side on similar issues. OP mentioned jitter, and that made me think about how I would solve the clock discrepancies, based on my experience in the software side with sensor data.
For S/PDIF where the recovered clock might have more jitter than one might wish, the usual solution is to use asynchronous sample-rate conversion with the DAC side clocked by a low-jitter oscillator.
VLC (the video player) has a feature that allows one to resample audio to the achieved display frame rate. It is surprising how small changes in the actual pitch are perceptible -- in that case, because the video frame rate was not stable, so the resampling ended up changing the audio pitch by accident. It is a surprisingly easy to notice very small changes, if the display frame rate is not rock solid stable.
It should be noted that strictly speaking, the S/PDIF receiver generates a sample-rate clock. The DAC needs a modulator clock, at 128, 256, 512 times the sample rate, and that clock is generated by a PLL that uses the recovered sample-rate clock as a reference.
I would assume a separately controlled high-frequency clock, compared to the SPDIF clock, would yield a much better result from not-very-good-quality SPDIF sources.
Algorithmically, the idea is to
decouple the generated clock from the original clock source, but control it to run stably at the
average rate of the original source. This kind of synchronization issue is surprisingly common in high data rate networked computation. Mathematically, you can describe the follower as a damped oscillator -- which is a pretty good description of a PLL based on the SPDIF clock --, but unless you deliberately add some latency, maybe just a few dozen samples worth, there always will be patterns that lead the follower to oscillate. In the algorithmic sense, it is very interesting and surprisingly difficult problem; and from the benefits of that latency (or "temporal slop"), I know that a PLL cannot be the best option here.
Adjusting properties according to human perception, or psychoacoustic modeling in this case, is very interesting. Many people do not realize that even MP3 compression involves psychoacoustic modeling: the quantization noise is shaped to follow the
equal-loudness contour.