The 16khz entering the bumpy Nyquist rate vibrating coefficients for interpolation will get modulated to be transformed into 6khz+38khz. The same modulation while going out.
You must be doing something wrong. Filtering is convolution (in the time domain), not modulation. Convolution does not create new frequencies.
Transient instruments, as percussion, piano, strings, when reproduced, sound different than the original. We, the listeners, we don't realize it but the musician does that it sounds a different instrument.
The spectrum of many instruments does extend into the ultrasound region. So yes, then it
cannot be reproduced exactly with 44.1kSa/s digital audio. I never objected that. I also pointed to an article which also claimed that 192 kSa/s was preferred by listeners, and the most prominent example in the article was a drum track, which definitively did contain frequencies >20kH.
The question of whether the reproduction still sounds "good enough" when the bandwidth is cut to 20 kHz is not an EE question, but a psychophysical one.
The EE question would be: Can a signal with a (say) 0...40 kHz occupied bandwidth be sampled at 44.1 KSa/s and then reconstructed almost perfectly? And the answer to this question is clearly "no".
Here is an example. https://youtu.be/efKabAQQsPQ?si=8m-YEIMPdW0mmKsm
It's a very interesting demonstration, but IMO not a good example for the desired use case, as it mainly shows the impact of group delay in the context of mixing the filtered channel with other channels. This does not apply to a CD player, where both channels (L+R) are always filtered equally and therefore have same group delays.
I want to remind that we give all the trouble to attenuate the image frequencies, to avoid intermodulating by the speaker's nonlinearty and generate audible frequencies. If the remedies are generating other modulating frequencies, is to replace the problem with another one.
Again, a filter does not generate new frequencies.
At this stage, I am convinced that 44.1khz is good for 16khz -3db.
For what purpose? For the player's reconstruction filter?
In practice, you need of course some headroom below Nyquist. So 22.05 kHz are definitively not feasible. But I consider 20 kHz (-0.1dB) @41.1 kSa/s well feasible with today's technology and sufficient oversampling.
Keep the following difference in mind:
The antialiasing filter on the recorder
must modify the input signal if it happens to contain components >20 kHz. This modification can of course make an audible difference. And it may also make a difference,
how it alters the signal in order to limit the bandwidth.
But the player's reconstruction filter does not need to do that. It is never confronted with with useful frequencies >20 kHz. They are not present on the CD, because the recorder has already filtered them out. So all it has to do is pass through 0...20 kHz unaltered and suppress the unwanted images sufficiently. It does not matter how the reconstruction filter at the player would hypothetically alter the original analog signal when it exceeds 20 kHz. This signal is never seen by the player.