View Full Document

ROBUST SPEECH RECOGNITION IN THE AUTOMOBILE



View the full content.
View Full Document
View Full Document

28 views

Unformatted text preview:

ROBUST SPEECH RECOGNITION IN THE AUTOMOBILE Nobutoshi Hanai and Richard M Stern Department of Electrical and Computer Engineering and School of Computer Science Carnegie Mellon University Pittsburgh PA 15213 ABSTRACT In this paper we discuss a number of the ways in which the recognition accuracy of automatic speech recognition systems is affected by ambient noise in the automobile along with the extent to which various techniques for robust speech recognition can provide for more robust recognition We consider separately the effects of engine noise interference by turbulent air outside the car interference by sounds from the car s radio and interference by the sounds of the car s windshield wipers Recognition accuracy was compared using baseline processing cepstral mean normalization CMN and codeword dependent cepstral normalization CDCN The greatest degradation in recognition accuracy was produced by interference from AM radio talk shows The use of CMN and especially CDCN was found to be significantly improve recognition accuracy except for the effects of interference from radio talk shows at low car speeds This type of interference is effectively suppressed through the use of adaptive noise cancellation techniques shield wipers In this paper we consider the effects of all of these sources of degradation and we compare the extent to which these effects are ameliorated by the compensation techniques of cepstral mean normalization CMN 6 and codeword dependent cepstral normalization CDCN 7 2 DATABASES The experimental results in this paper were obtained by training the CMU SPHINX I system 8 on the previously described census database 7 and tested using a database of speech recorded in automobiles recorded by and provided by the Motorola Corporation In this section we describe the Motorola automotive database which was used to evaluate effects of the noise in the automobile on the SPHINX system We also briefly review the contents of the census database 2 1 The Motorola Automotive Database 1 INTRODUCTION The need for robustness in speech recognition accuracy in real applications environments such as long distance telephone lines automobiles aircraft cockpits offices and factory floors is becoming increasingly important as speech recognition is becoming more successful This paper concerns speech recognition accuracy in the automobile which is a critical factor in the development of hands free cellular telephony Major factors that impede recognition accuracy in the automobile include noise sources such as tire and wind noise while the vehicle is in motion engine noise and noise produced by the car radio fan windshield wipers horn turn signals etc A number of researchers have considered the problem of robust recognition in the automobile previously Their approaches include adaptive noise cancelling techniques e g 1 2 spectral transformation 3 the use of microphone arrays e g 4 and multi dimensional HMMs 5 For the most part these studies dealt only with running noise sources such as tire engine and wind noise and they did not consider functional noise caused by functional components such as the car radio fan and wind Currently at Mitsubishi Heavy Industries Ltd The Motorola automotive database consists of 12 speakers 9 males and 3 females in their 20s and 30s Each speaker uttered six 7 digit strings at three driving speeds 0 with engine idling 30 and 55 m p h and the following six conditions in the vehicle 1 baseline windows up fan radio and windshield wipers off 2 driver s window down 3 fan on 4 FM radio playing music 5 AM radio playing a talk show 6 windshield wipers on recorded at 0 m p h only The digit strings were read from a script with equal probabilities for all digits The digit 0 had two pronunciations zero and oh Speech was recorded on a DAT recorder in various automobiles using 2 microphones located on the driver s visor The microphone used for our data was a high fidelity Sony ECM 959DT which uses an electret element and has a flat bandpass response over 50 18 000 Hz The data were lowpass filtered to about 6 720 Hz before sampling at 16 kHz using the line inputs of an Ariel Digital Microphone Since the goal for collecting the database was to make it as realistic as possible the recording conditions were somewhat variable and reflected what an untrained population of users might produce Some of the files for various speakers were missing due to recording problems which were not noticed until the data were reviewed 2 2 The Census Database 4 EXPERIMENTAL RESULTS No attempt was made to optimize the SPHINX I system for the 11 digit Motorola database For example the the census database has a larger vocabulary size than the automotive database and performance could have been improved by recomputing the phonetic models for this far more constrained task 3 NOISE CHARACTERISTICS Amplitude dB Relative Amplitude dB As noted above we distinguish between the running noise caused by window engine and tire noise that primarily depends on vehicle speed and functional noise which depends on operator controlled functions such as the heater fan radio and windshield wipers Spectral analysis of the Motorola automotive database reveals a peak at very low frequencies in the idling condition This peak becomes smaller and the spectrum of the noise broadens as vehicle speed is increased With the windows down wind noise becomes more intense and the distribution of the noise power becomes broader as shown in Figure 1 20 Car speed 55 m p h 10 0 We performed speech recognition experiments with the SPHINX recognition system using the census database for training and the Motorola automotive database for testing Speech was sampled at 16 kHz and conventional Mel frequency cepstral coefficients MFCC were used as the baseline parametric representation of speech frames 9 We made use of previously trained phonetic models for SPHINX I which consisted of 400 generalized triphone models for a vocabulary size of 104 The language model and pronunciation dictionary were restricted to the 11 words one through nine oh and zero that are present in the 7 digit strings in the Motorola automotive database We compared recognition accuracy obtained using three types of signal processing the baseline MFCC representation cepstral mean normalization CMN e g 6 and codeword dependent cepstral normalization CDCN 7 The simple CMN method compensates primarily for differences in the frequency response of each channel while CDCN


Access the best Study Guides, Lecture Notes and Practice Exams

Loading Unlocking...
Login

Join to view ROBUST SPEECH RECOGNITION IN THE AUTOMOBILE and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view ROBUST SPEECH RECOGNITION IN THE AUTOMOBILE and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?