Speech Recognition in Hardware: For Use as a NovelInput DeviceNicholas HarringtonTao B. SchardlDecember 10, 2008iAbstractConventional computer input devices often restrict programs to exclusively utilizebutton presses and mouse movements as input, even when such an interface is not themost intuitive one for the user using that application. To address this restriction inthe case where the more intuitive control interface is the user’s voice, we created anisolated word speech recognition system in hardware and attached it to a conventionaljoystick interface. Users are able to train this system on a set of words, and this systemsubsequently translates the recognition of distinct words into distinct button presses atthe joystick interface, allowing our device to communicate seamlessly with a computeras a joystick. This device’s functionality has been verified both through controlledsystem testing and gameplay testing, in which it has been used both exclusively tocontrol a game and in conjunction with another input device.iiContents1 Overview 11.0.1 MATLAB Implementation . . . . . . . . . . . . . . . . . . . . . . . . 42 Summary of Modules 42.1 Audio Preprocessing/Vector Generation . . . . . . . . . . . . . . . . . . . . . 62.1.1 Discarding Significant Bits . . . . . . . . . . . . . . . . . . . . . . . . 62.1.2 Run Time Parameter Control . . . . . . . . . . . . . . . . . . . . . . 62.1.3 Low Pass Filter and Downsampler . . . . . . . . . . . . . . . . . . . . 62.1.4 Pre-emphasis Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.1.5 Window Applier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.1.6 FFT Feeding Buffer and FFT . . . . . . . . . . . . . . . . . . . . . . 72.1.7 Magnitude Finder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.1.8 Mel-scale Spectrum Calculator . . . . . . . . . . . . . . . . . . . . . . 82.1.9 Energy Finder and Word Detector . . . . . . . . . . . . . . . . . . . . 82.1.10 Cepstral Coefficient Generator . . . . . . . . . . . . . . . . . . . . . . 82.2 Word Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.2.1 Master . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.2.2 DTW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.2.3 Judge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.3 System Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.3.1 Bar Graph Display . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.3.2 DTW Display . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.4 Joystick Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 Testing 143.1 Unit Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143.1.1 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.1.2 Word Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.1.3 Experimentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.1.4 Gameplay Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174 Conclusion 18iiiList of Figures1 Mel Scale and Conversion Windows . . . . . . . . . . . . . . . . . . . . . . . 22 Example of DTW Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 Modular Breakdown of Preprocessing and Vector Generation Component . . 54 Modular Breakdown Word Recognition Component . . . . . . . . . . . . . . 95 Breakdown of DTW Module . . . . . . . . . . . . . . . . . . . . . . . . . . . 1111 OverviewWe implemented an isolated word speech recognition system in hardware. Every time a wordis uttered into a microphone it is compared against a set of words stored in memory in orderto determine if it matches one of them. If a match is found a signal is sent via a parallelport to a computer that interprets it as joystick input.Matching two audio signals first entails compressing the audio data into a form meaningfulfor speech recognition. In our case each 30 ms section of audio is converted into a vector of 16numbers. Each stored word or incoming word is represented as an array of these vectors. Wecompare these arrays of vectors with an algorithm known as dynamic time warping (DTW).The vectors used to parameterize the audio signal are the Mel-scale …
View Full Document