Author Topic: From Microphone to Frequency Domain? (Read 1565 times)

jwhitmore · « **on:** August 18, 2017, 10:10:19 pm »

First off I'm more of an embedded programmer who'd like to do a Machine learning project on Speech Recognition. At present, from that I can find, most systems in mobile phones send the recorded audio up to a server in the cloud for processing and recognition. I thought if you had a dedicated piece of Hardware with a DSP which went from the audio of a microphone directly to the Frequency Domain that you could then feed that into a local Neural Network for recognition.

So whilst that would be a nice project to do some DSP programming (first time for everything) I decided that why go to the effort when that must already exist. So a little box with an analog audio input and a digital Frequency domain output. If it does exist I can't find it. I can't believe that it don't exist, surely it's a DSP performing an FFT? That's what DSP's do. I must be looking up the wrong search term for this. So if anybody can enlighten me I'd be really grateful.

Maybe I should go back to reading http://www.dspguide.com and figuring out how you'd actually do it. Who knows I might learn something.

DaJMasta · « **Reply #1 on:** August 18, 2017, 10:55:17 pm »

DSPs (or FPGAs) are going to be better at it, but micros have some capacity to do FFT work - higher end arm cores often have a set of DSP extensions that can speed up FFTs quite a bit. A bit part of the problem, though, is memory. Because to get a meaningful frequency domain representation of speech, you'd need fairly small bin sizes, and if you're trying to span a few octaves, that can be a lot of bins to get detail at the low frequency end.

That being said, I'm not 100% sure that frequency domain is the primary component for speech recognition - it may do the job and I don't have the expertise to tell you how it's usually done, but I think the articulation to words ends up being a big component of differentiating speech, and they are both very short in duration and sort of wild in frequency content, though maybe that can be easily characterized.

If I were to attempt something like this, I may see if a micro had the performance just to see, but I'd focus on either a DSP or an FPGA and I'd break the problem up into multiple FFTs - maybe a thousand bins per octave, but a separate FFT window for each octave, slightly overlapped - that way the bin size is closer to logarithmic scale of the tones so you don't need nearly the memory as you would with a single window of the same resolution on the low end - and to increase the detection speed, I'd consider running two or three staggered FFT groups, so that you can catch narrower frequency transitions with the windows overlapping in time - though with a sufficient sample rate and fairly small FFT resolution bandwidth, you may get fast enough updates from just a single set.

I think the easiest way to get what you're describing would be to use a regular sound card input and a small computer - something like a raspberry pi and some FFT software would be sufficient and easy to configure, even if dedicated DSP hardware would be much more efficient when built and programmed.

jwhitmore · « **Reply #2 on:** August 19, 2017, 09:56:21 am »

Thanks for all that information. Very useful. The last note about a raspberrypi and FFT software would be a great place to start and just get an idea of the problem space, which I'm currently lacking. I'm sort of surprised this has not been done already. Anyhow thanks agian

DaJMasta · « **Reply #3 on:** August 19, 2017, 02:55:40 pm »

While it won't give you an idea of how an embedded chip would perform, if you've got a mic on your computer you could download something like Spectrum Lab and fiddle with settings to see what there is to be seen without any hardware investment.

MK · « **Reply #4 on:** August 19, 2017, 09:19:51 pm »

If you want to know a bit more about speech regognition and synthesis start by looking up ACELP, there is more to it than just generating the fft.

alexanderbrevig · « **Reply #5 on:** August 19, 2017, 09:42:50 pm »

It's a problem that is hard to solve with machine learning. You will need many FFTs over time, as speech is something expressed over time. ANN can be cascaded and made 'deep'.
It may make sense to train the phonemes but not words or sentences imho. That will be way too FLOPs and memory heavy for anything embedded and off the shelf currently.

Good luck


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

EEVblog Electronics Community Forum

Author Topic: From Microphone to Frequency Domain? (Read 1565 times)

jwhitmore

From Microphone to Frequency Domain?

DaJMasta

Re: From Microphone to Frequency Domain?

jwhitmore

Re: From Microphone to Frequency Domain?

DaJMasta

Re: From Microphone to Frequency Domain?

MK

Re: From Microphone to Frequency Domain?

alexanderbrevig

Re: From Microphone to Frequency Domain?

Share me