I know this is a moldy oldie topic now, but I just saw this episode and thought I could shed some light on this matter. I worked for a speech recognition company around that time period (not the same one), and I am 95% certain that they have implemented the same sort of tech were were using in the day.
It's much simpler than it sounds, gives workable results, and requires very little processing. Essentially, what it does is count the zero crossings in the speech recording and keep track of how much time has passed between them. It discards all other information.
The timings are threshold filtered and normalized (so that absolute speed of speech doesn't matter, only the ratios of the zero crossing times), and the resulting timing set is compared to a table. The one that it looks the most like is the winner.
I am, of course, glossing over fiddly details, but this is how it works. All the math required is simple and can be done with 8 bit integers.
This chip reminded me of an article I read some while back in Circuit Cellar about voice recognition with a 8-bit microcontroller.
Fortunately, you can download it as PDF: Low-Cost Voice Recognition, by Brad Stewart, 1998.
If you look at the schematic and figure 1, you will see some commonalities with the VCP200.
Yes, I worked with Brad Stewart (I still see him occasionally). That article describes an early incarnation of this method.