DSPs (or FPGAs) are going to be better at it, but micros have some capacity to do FFT work - higher end arm cores often have a set of DSP extensions that can speed up FFTs quite a bit. A bit part of the problem, though, is memory. Because to get a meaningful frequency domain representation of speech, you'd need fairly small bin sizes, and if you're trying to span a few octaves, that can be a lot of bins to get detail at the low frequency end.
That being said, I'm not 100% sure that frequency domain is the primary component for speech recognition - it may do the job and I don't have the expertise to tell you how it's usually done, but I think the articulation to words ends up being a big component of differentiating speech, and they are both very short in duration and sort of wild in frequency content, though maybe that can be easily characterized.
If I were to attempt something like this, I may see if a micro had the performance just to see, but I'd focus on either a DSP or an FPGA and I'd break the problem up into multiple FFTs - maybe a thousand bins per octave, but a separate FFT window for each octave, slightly overlapped - that way the bin size is closer to logarithmic scale of the tones so you don't need nearly the memory as you would with a single window of the same resolution on the low end - and to increase the detection speed, I'd consider running two or three staggered FFT groups, so that you can catch narrower frequency transitions with the windows overlapping in time - though with a sufficient sample rate and fairly small FFT resolution bandwidth, you may get fast enough updates from just a single set.
I think the easiest way to get what you're describing would be to use a regular sound card input and a small computer - something like a raspberry pi and some FFT software would be sufficient and easy to configure, even if dedicated DSP hardware would be much more efficient when built and programmed.