Just read up (google) how a WAV file is stored. You have a number of bits per sample (8 bit , 16 bit and 24 bit per sample are the most common) and then you have the frequency, the number of samples per second .. values like 11025 , 22100 , 44100 Hz (cd music) , 48000 are very common...
So a CD quality audio file will require 2 channels x 2 bytes (16bit per sample) x 44100 (samples per second) = 176,400 bytes per second
For just human voice, you could go down to mono (1 channel) and 11025 hz and 8 bit per sample and that's basically 11 KB/s of uncompressed audio.
If you compress with some audio codec like Opus for example, you can squeeze that in 1-2KB/s easily.