I've never done much with Audio on USB, but did a little search:
https://duckduckgo.com/html?q=USB+audio+class+2.0+And among the first that popped up was Cmedia, and I had a peek at the 16page "datasheet" of th CM6610A.
They seem to think that high quality audio is just pushing more bits, and (as almost everybody seems to do nowadays) derive everything with a PLL from a cheap 12MHz crystal.
If you take audio seriously, then extremely stable and low jitter clocks are mandatory.
This seems to be a very often overlooked, or completely ignored fact.
If you want true 24 bit audio with a 192kHz klock, then you need to keep long term jitter in the DAC and ADC to within
>> 1/2^24/192000
ans = 3.1044e-13
Yep, that's 310 femto seconds.
Edit:
Oops, error. The sample rate should probably not be part of this equation, but the maximum signal frequency is, so it probably should be:
>> 1/2^24/20000
ans = 2.9802e-12
Which is 3ps instead of femto seconds. It does not matter much though, as accurate calculations need actual signals instead of a simple sine, but the general idea still stands. For good audio you need very accurate clocks.
(Accuracy is not the main goal, but stability is, But I'm afraid I'm getting a bit too anal here).
It's very easy to deduce that you need this accuracy.
For a simple sine(), the derivative is a cosine(), which has the same maximum value of 1.
Which means that the time resoultion for the samples is roughly equivalent to the time resolution.
Without the all invading trend of hyper compression which demolishes most audio nowaday's, audio regularly used to have a crest factor of 5 (or more) which makes it even worse.
Without acknowledging this, you will just design a 100 in a dozen mediocre audio gadged instead of a gread sounding device.