Let's expand what TimFox described.
Conventional audio CD's contain metadata (including error correction), and uncompressed pulse-code modulated 16-bit stereo (two channel) data sampled at 44100 samples per second. The maximum slew rate is therefore 2¹⁶ = 65536 quantization steps per one 44100'th of a second, or 65536×44100 = 2890137600 steps/second ≃ 2.89 quantisation steps per nanosecond.
If the audio CD data contains an alternating sample sequence (-32768, +32767, -32768, +32767, ...), per Nyquist-Shannon theorem, it should be reconstructed as a perfect 22050 Hz sine wave with maximum amplitude. This leads to the \$2 \pi f V_{pk}\$ (here, \$138544 \, V_{pk}\$ per second, or \$0.1385 \, V_{pk}\$ per microsecond) minimum slew rate required for both the incoming circuitry before the ADC and the output amplifier circuitry after the DAC.
Does the DAC have more stringent slew rate requirements? Slewing from rail to rail in a single sample period is \$2 \, V_{pk}\$ in one 44100'ths of a second, or \$88200 \, V_{pk}\$ per second, which is less than the aforementioned reconstruction limit. This means that with a theoretically perfect brick-wall low-pass filter, the \$0.1385 \, V_{pk}\$ slew rate suffices, but this slew rate is \$1.5707\$ times (\$\pi/2\$) as fast as just slewing from rail to rail in a single sample period. This affects our choice of DACs, as just being able to slew from rail to rail in a single sample period is not sufficient; it needs to slew basically 1.5707 times rail-to-rail range, in a single sample period.
If we did a Fourier analysis of the error spectrum at different (higher than necessary) DAC slew rates, we'd find a faster slew rate does push the error noise somewhat higher in the output spectrum, which makes it easier to filter out this particular error using analog circuits. Note that we're still assuming a brick-wall low-pass filter at 22050 Hz for reconstructing the highest-frequency components in the audio signal, though.
For the ADC, there is no "slew rate" requirement as such, if we assume instantaneous sampling in time, or integration of the input signal for the duration of each sample. Existing ADCs differ, but their frequency response is known, and can be compensated using filters prior to the ADC. The main problem is that for perfect signal capture, we'd again need a brick wall low-pass filter at 22050 Hz, and AC coupling (i.e. rejecting the DC component). In practice, human hearing does not go below 10 Hz - 20 Hz, so signals below 10 Hz can be rejected also, giving us a brick-wall 10 Hz to 22050 Hz band-pass input filter requirement.
Here we get into the realm that will forever weird out Audiophōles. If we replace the exact DAC with an oversampled dithering DAC, we can push all quantization noise to higher frequencies, so that we don't need a brick wall filter; a more realistic low-pass filter can reconstruct our original signal perfectly.
Looking up
delta-sigma modulation shows how these are done in practice, noting that the intermediate steps (within the modulation scheme) do require much higher clock rates than the sample rate used. It is also useful to note that single-bit delta-sigma modulation is
pulse-density modulation, exactly.
It is also useful to understand what kind of voltages are used to process and transfer audio signals. The most common standard is
line level, which has \$V_{pk}\$ of 1.414 V at 0 dBu (decibels unloaded) in "consumer" devices, and 1.095 V at 0 dBu in "pro" devices, with signals clipped to somewhere between ±1.5V and ±2.0V. Thus, an initial assumption of \$V_{pk} \approx 1.4 \text{ V}\$ for consumer line-level audio that does not clip, is sensible.
Combining all of the above, and a rough estimate of \$f = 20 \text{ kHz}\$ for the highest frequency we humans care about, we can say that using
line levels, the maximum slew rate needed is \$2 \pi f V_{pk} \approx 0.2 \text{ V/µs}\$.
To understand the range in slew rates we should consider, let's consider superhuman hearing that can detect components up to \$f = 25 \text{ kHz}\$, and a fully clipping signal with say \$V_{pk} = 2 \text{ V}\$. The slew rate we get for this is \$\approx 0.31 \text{ V/µs}\$.
Because we are talking about stereo audio, however, we do need to consider the one oddity about
human hearing: time discrimination. Humans can detect audio signal time separation down to 10 µs, which corresponds to 100 kHz. That is, because of the exact mechanism of human hearing (which is very much a spectrum analyzer, rather than time-domain sampling), humans can detect much smaller time delays than the maximum frequency they can hear. For engineers, the time-domain discrimination for changes in the spectrum detected in each ear is 10 µs. In turn, this does mean that even though 20 Hz .. 20 kHz bandwidth per ear suffices, we may need much higher bandwidth to properly represent 3D audio effects, because of the out extreme time-domain discrimination ability!
Which also explains why 192 kHz and audio sampling, even when band-filtered to say 10 Hz ... 20 kHz, can produce superior
stereo/3D audio experience. Of course, that only really applies when the speaker configuration matches the microphone configuration, preferably a human head acoustic model with earlobes and all.
It also turns out that most 3D effects do not rely on high time-domain discrimination at all, but more on spectrum shaping; basically, our earlobes and the shape of our head causes sound spectra to be filtered differently based on their direction, with the time-domain separation being just "fine tuning" on top of that. You can investigate and experiment on this further by looking into the open source
OpenAL library.