If you care about voice reproduction to the extent of transmitting minor variations, then this is not going to work at all.
Otherwise, the idea of using unique voice characteristics and using them for compression is not new. This is basically what Linear Predictive Coding (LPC) does on a low level. If you have a specific speaker in mind, you can create and pre-train a better predictive model. It will not be as efficient as plain text, but it will transfer actual voice.