Frequency Warping for VTLN and Speaker Adaptation

Home> Academic Documents> Frequency Warping for VTLN and Speaker Adaptation

DOC PREVIEW

This preview shows page 1-2-16-17-18-33-34 out of 34 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 34 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 34 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 34 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 34 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 34 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 34 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 34 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 34 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Frequency Warping for VTLN and SpeakerAdaptation by Linear Transformation ofStandard MFCCSankaran Panchapagesan∗, Abeer AlwanDepartment of Electrical Engineering, The Henry Samueli School of Engineeringand Applied Science, 66-147E Engr. IV, 405 Hilgard Avenue, Box 951594,University of California, Los Angeles, CA 90095-1594, USAAbstractVocal Tract Length Normalization (VTLN) for standard filterbank-based MelFrequency Cepstral Coefficient (MFCC) features is usually implemented by warp-ing the center frequencies of the Mel filterbank, and the warping factor is estimatedusing the maximum likelihood score (MLS) criterion (Lee and Rose, 1998). A lineartransform (LT) equivalent for frequency warping (FW) would enable more efficientMLS estimation (Umesh et al., 2005). We recently proposed a novel LT to performFW for VTLN and model adaptation with standard MFCC features (Panchapage-san, 2006). In this paper, we present the mathematical derivation of the LT and givea compact formula to calculate it for any FW function. We also show that our LT isvery closely related to previously proposed LTs for FW (McDonough, 2000; Pitz etal., 2001; Umesh et al., 2005), and these LTs for FW are all found to be numericallyalmost identical for the sine-log all-pass transform (SLAPT) warping functions. Ourformula for the transformation matrix is, however, computationally simpler and un-like other previous linear transform approaches to VTLN with MFCC features (Pitzand Ney, 2003; Umesh et al., 2005), no modification of the standard MFCC featureextraction scheme is required. In VTLN and Speaker Adaptive Mo del ing (Welling etal., 2002) experiments with the DARPA Resource Management (RM1) database, theperformance of the new LT was comparable to that of regular VTLN implementedby warping the Mel filterbank, when the MLS criterion was used for FW estimation.This demonstrates that the approxi mations involved do not lead to any performancedegradation. Performance comparable to front end VTLN was also obtained withLT adaptation of HMM means in the back end, combined with mean bias and vari-ance adaptation according to the Maximum Likelihood Linear Regression (MLLR)framework. The FW methods performed significantly better than standard ML LRfor very limited adaptation data (1 utterance), and were equally effective with un-supervised parameter estimation. We also performed Speaker Adaptive Training(SAT) with feature space LT denoted CLTFW. Global CLTFW SAT gave resultscomparable to SAM and VTLN. By estimating multiple CLTFW transforms usinga regression tree, and including an additive bias, we obtained significantly improvedresults compared to VTLN, with increasing adaptation data.Preprint submitted to Elsevier 26 June 2008Key words: automatic speech recognition, speaker normalization, VTLN,frequency warping, linear transformation, speaker adaptation1 IntroductionVo cal Tract Length Normalization (VTLN) is a speaker normalization techniquewidely u sed to improve the accuracy of speech recognition systems. In VTLN, spectralmismatch caused by variation in vocal tract lengths of speakers is reduced by per-forming spectral frequency warping (FW) or its equivalent, typically during featureextraction. VTLN has proven to be particularly effective when only limited adap-tation data from a test speaker is available, even in an unsupervised mode. Theestimation and implement ation of frequency warping have received much attention inrecent years.The parameters controlling the FW are commonly estimated by optimizing a maxi-mum likelihood (ML) criterion over the adaptation data. The ML criterion could bethe ASR likelihood score of the recognizer over the adaptation data (Lee and Rose,1998; Pitz et al., 2001; Pitz and Ney, 2003), the EM auxiliary function (Dempsteret al., 1977; McDonough, 2000; Loof et al., 2006), or likelihoods of Gaussian mixturemodels (GMMs) trained specifically for FW parameter estimation (Wegmann et al.,1996; Lee and Rose, 1998). Another FW estimation method is by alignment of for-mants or formant-like spectral peaks between the test speaker and a reference speakerfrom the training set (Gouvea and Stern, 1997; Claes et al., 1998; Cui and Alwan,2006).Maximizing the likelihood score is commonly performed using grid search over a setof warping factors, when the FW is described by a single parameter that controlsthe scaling of the frequency axis (Lee and Rose, 1998). More recently, optimizationmethods based on the gradient and higher order derivatives of the objective functionhave been used to estimate the FW function. This allows efficient estimation of mul-tiple parameter FW functions like the All-Pass Transform (APT) FWs, which cangive better recognition performance than single parameter FWs (M cDonough, 2000;Panchapagesan and Alwan, 2006).Frequency warping of the spectrum has been shown to correspond to a linear trans-formation in the cepstral space (McDonough et al., 1998; Pitz et al., 2001). Thisrelationship confers some important advantages for speech recognition systems thatuse cepstral features. Firstly, one can apply the linear transform to previously com-∗Corresponding Author.Email addresses: [email protected] (Sankaran Panchapagesan),[email protected] (Abee r Alwan).2puted unwarped features and not have to recompute features with different warpfactors during VTLN estimation. This results in significant computational savings(Umesh et al., 2005), which would be important in embedded and distributed speechrecognition (DSR) applications, where r esources are limited. Given the recognitionalignment of an utterance obtained with baseline models without VTLN, it can beshown by a rough calculation that parameter estimation for Regular VTLN is about2.5 times as expensive as for LT VTLN, when the fixed alignment is used for VTLNestimation with the MLS criterion, with single Gaussian mixture HMMs and a gridsearch. The linear transform approach also has the advantage that one need not haveaccess to any of the intermediate stages in the feature extraction during VTLN esti-mation. This aspect would have definite advantages in DSR, where feature extractionis performed at the client and recognition is performed at the server. During VTLNestimation using a grid search over warping factors, since it would be impracticalfor the client to recompute and transmit features for each warping factor, warpedfeatures would have to be computed at the server. With a linear tran sform, onlythe cepstral transformation matrices


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2-16-17-18-33-34 out of 34 pages.

Please select your school