EFFICIENT JOINT COMPENSATION OF SPEECH FOR THE EFFECTS OF ADDITIVE NOISE AND LINEAR FILTERING

Home> Academic Documents> EFFICIENT JOINT COMPENSATION OF SPEECH FOR THE EFFECTS OF ADDITIVE NOISE AND LINEAR FILTERING

DOC PREVIEW

This preview shows page 1 out of 4 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 4 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 4 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

EFFICIENT JOINT COMPENSATION OF SPEECHFOR THE EFFECTS OF ADDITIVE NOISE AND LINEAR FILTERING12 1Fu-Hua Liu , Alejandro Acero , and Richard M. Stern1 2Department of Electrical and Computer Engineering Apple Computer, Inc.School of Computer Science MS 76-7FCarnegie Mellon University 20525 Mariani AvenuePittsburgh, PA 15213 Cupertino, CA 95014speech samples from both the training and testing environmentsABSTRACTat each SNR of speech in the testing environment. At high SNRs,As automatic speech recognition systems are finding their waythis correction vector primarily compensates for differences ininto practical applications it is becoming increasingly clear thatspectral tilt between the training and testing environments (in athey must be able to accommodate a variety of acoustical en-manner similar to the blind deconvolution procedure firstvironments. This paper describes two algorithms that provideproposed by Stockham et al. [4]), while at low SNRs the vectorrobustness for automatic speech recognition systems in a fashionprovides a form of noise subtraction (in a manner similar to thethat is suitable for real-time environmental normalization forspectral subtraction algorithm first proposed by Boll [5]). Theworkstations of moderate size. The first algorithm is a modifica-SDCN algorithm is simple and effective, but for every newtion of the previously-described SDCN and FCDCN algorithms,acoustical environment encountered it must be calibrated with aexcept that unlike these algorithms it provides computationally-new stereo database that contains samples of speech simul-efficient environmental normalization without prior knowledge oftaneously recorded in the training and testing environments. Inthe acoustical characteristics of the environment in which themany situations such a database is impractical or unobtainable,system will be operated. The second algorithm is a modificationand SDCN is clearly not able to model a non-stationary environ-of the more complex CDCN algorithm that enables it to performment since only long-term averages are used.environmental compensation in better than real time. We com-The second compensation algorithm, Codeword-Dependentpare the recognition accuracy, computational complexity, andCepstral Normalization (CDCN), uses EM techniques to computeamount of training data needed to adapt to new acoustical en-ML estimates of the parameters characterizing the contributionsvironments using these algorithms with several different types ofof additive noise and linear filtering that when applied in inverseheadset-mounted and desktop microphones.fashion to the cepstra of an incoming utterance produce an en-1. INTRODUCTIONsemble of cepstral coefficients that best match (in the ML sense)the cepstral coefficients of the incoming speech in the testingResults of several studies have demonstrated that even automaticenvironment to the locations of VQ codewords in the trainingspeech recognition systems that are designed to be speaker inde-environment. Use of the CDCN algorithm improved the recog-pendent can perform very poorly when they are tested using anition accuracy obtained when training on the CLSTLKdifferent type of microphone or acoustical environment from themicrophone and testing with the PZM6FS to the level observedone with which they were trained (e.g. [1, 2, 3]). For example,when the system is both trained and tested on the PZM6FS. Thethe recognition accuracy of the SPHINX speech recognition systemCDCN algorithm has the advantage that it does not require aon a speaker-independent alphanumeric task dropped from 85%priori knowledge of the testing environment (in the form ofcorrect to less than 20% correct when the close-talking Sen-stereo training data in the training and testing environments), butnheiser HMD-414 microphone (CLSTLK) used in training wasit is much more computationally demanding than the SDCNreplaced by the omnidirectional Crown PZM6FS desktopalgorithm. Compared to the SDCN algorithm, the CDCN algo-microphone (PZM6FS) [1].rithm uses a greater amount of structural knowledge about theWe have found that two major factors that degrade the perfor-nature of the degradations to the speech signal in order to achievemance of speech recognition systems using desktop microphonesgood recognition accuracy. The SDCN algorithm, on the otherin normal office environments are additive noise and unknownhand, derives its compensation vectors entirely from empiricallinear filtering. We showed in [1] that simultaneous joint com-observations of differences between data obtained from the train-pensation for the effects of additive noise and linear filtering ising and testing environments.needed to achieve maximal robustness with respect to acousticalMore recently we presented, along with several other algorithms,differences between the training and testing environments of athe fixed CDCN (FCDCN) algorithm [6]. FCDCN combinesspeech recognition system. We described in [1] two algorithmssome of the more attractive features of the CDCN and SDCNthat can perform such joint compensation, based on additivealgorithms: like SDCN, the correction factor equals the dif-corrections to the cepstral coefficients of the speech waveform.ference in cepstra between the training and testing environments,The first compensation algorithm, SNR-Dependent Cepstralbut like CDCN, the correction factor is different for different VQNormalization (SDCN), applies an additive correction in thecodewords as well. This algorithm is also simple and efficient,cepstral domain that depends exclusively on the instantaneousand it can achieve a level of recognition accuracy comparable toSNR of the signal. This correction vector equals the averagethat of CDCN. Unfortunately, FCDCN (like SDCN) also requiresdifference in cepstra between simultaneous "stereo" recordings ofthe use of a training database of simultaneously-recorded speechsamples in the training and testing environments. Hence, the series expansion of ln P ( ω ), ln P ( ω ), ln P ( ω) andxnyFCDCN algorithm also cannot adapt to unknown environments.2ln |H( ω) | respectively, Eq. (1) can be rewritten asTable 1 compares the environmental specificity, computationaly = x + q + r(x, n,q) (2)complexity, and recognition accuracy of these algorithms whenwhere the correction vector r(x,n, q) is given byevaluated on the alphanumeric database described in [1]. Recog-DFT [n − q − x]nition accuracy is somewhat greater than the figures reported


School:
Email:
New Password:
Confirm Password:

This preview shows page 1 out of 4 pages.

Please select your school