icassp2007

Home> Academic Documents> icassp2007

DOC PREVIEW

icassp2007

This preview shows page 1 out of 4 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 4 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 4 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

COMPARISON OF LARGE MARGIN TRAINING TO OTHER DISCRIMINATIVE METHODSFOR PHONETIC RECOGNITION BY HIDDEN MARKOV MODELSFei Sha∗Computer Science DivisionUniversity of California527 Soda HallBerkeley, CA 94720-1776Lawrence K. Saul†Department of Computer Science and EngineeringUniversity of California (San Diego)9500 Gilman Drive, Mail Code 0404La Jolla, CA 92093-0404ABSTRACTIn this paper we compare three frameworks for discriminative train-ing of continuous-density hidden Markov models (CD-HMMs). Spe-cifically, we compare two popular frameworks, based on conditionalmaximum likelihood (CML) and minimum classification error (MCE),to a new framework based on margin maximization. Unlike CMLand MCE, our formulation of large margin training explicitly penal-izes incorrect decodings by an amount proportional to the numberof mislabeled hidden states. It also leads to a convex optimizationover the parameter space of CD-HMMs, thus avoiding the problemof spurious local minima. We used discriminatively trained CD-HMMs from all three frameworks to build phonetic recognizers onthe TIMIT speech corpus. The different recognizers employed ex-actly the same acoustic front end and hidden state space, thus en-abling us to isolate the effect of different cost functions, parameter-izations, and numerical optimizations. Experimentally, we find thatour framework for large margin training yields significantly lowererror rates than both CML and MCE training.Index Terms— speech recognition, discriminative training, MMI,MCE, large margin, phoneme recognition1. INTRODUCTIONMost modern speech recognizers are built from continuous-densityhidden Markov models (CD-HMMs). The hidden states in these CD-HMMs model different phonemes or sub-phonetic elements, whilethe observations model cepstral feature vectors. Distributions of cep-stral feature vectors are most often represented by Gaussian mixturemodels (GMMs). The accuracy of the recognizer depends criticallyon the careful estimation of GMM parameters.In this paper, we present a systematic comparison of severalleading frameworks for parameter estimation in CD-HMMs. Theseframeworks include a recently proposed scheme based on the goalof margin maximization [1, 2], an idea that has been widely appliedin the field of machine learning. We compare the objective func-tion and learning algorithm in this framework for large margin train-ing to those of other traditional approaches for parameter estimationin CD-HMMs. The most basic of these traditional approaches in-volves maximum likelihood (ML) estimation. Mainly, however, wefocus on competing discriminative methods in which parameters are∗Part of the work was performed at University of Pennsylvania.†This work is supported by the National Science Foundation under GrantNumber 0238323 and the UCSD FWGrid Project, NSF Research Infrastruc-ture Grant number NSF EIA-0303622.estimated directly to maximize the conditional likelihood [3, 4] orminimize the classification error rate [5]. Though not as straightfor-ward to implement as ML estimation, discriminative methods yieldmuch lower error rates on most tasks in automatic speech recogni-tion (ASR).We investigate salient differences between CML, MCE, and largemargin training through carefully designed experiments on the TIMITspeech corpus [6]. Though much smaller than typical corpora usedfor large vocabulary ASR, the TIMIT corpus provides an apt bench-mark for evaluating the intrinsic merits of different frameworks fordiscriminative training. We compare the phonetic error rates on theTIMIT corpus from multiple systems trained with different param-eterizations, initial conditions, and learning algorithms. All otheraspects of these systems, however, were held fixed. In particular, thedifferent systems employed exactly the same acoustic front end andmodel architectures (e.g., monophone CD-HMMs with full Gaussiancovariance matrices). From the results of these experiments, we areable to tease apart the significant factors that differentiate competingmethods for discriminative training.The paper is organized as follows. In section 2, we review CD-HMMs as well as several different methods for parameter estimation,including our own recent formulation of large margin training [1, 2].In section 3, we compare the performance of phonetic recognizerstrained in all these different frameworks. Finally, in section 4, weconclude with a brief discussion of future directions for research.2. PARAMETER ESTIMATION IN HMMSCD-HMMs define a joint probability distribution over a hidden statesequence S = {s1, s2, . . . , sT} and an observed output sequenceX = {x1, x2, . . . , xT}, given bylog p(X, S) =Xt[log p(st|st−1) + log p(xt|st)] . (1)For ASR, the hidden states stand observed outputs xtdenote pho-netic labels and acoustic feature vectors, respectively, and the distri-butions p(xt|st) are typically modeled by multivariate GMMs:p(xt|st= j) =MXm=1ωjmN (xt; µjm, Σjm). (2)In eq. (2), we have used N (x; µ, Σ) to denote the Gaussian dis-tribution with mean vector µ and covariance matrix Σ, while theconstant M denotes the number of mixture components per GMM.The mixture weights ωjmin eq. (2) are constrained to be nonnega-tive and normalized:Pmωjm= 1 for all states j.Let θ denote all the model parameters including transition prob-abilities, mixture weights, mean vectors, and covariance matrices.The goal of parameter estimation in CD-HMMs is to compute theoptimal θ∗(with respect to a particular measure of optimality), givenN pairs of observation and target label sequences {Xn, Yn}Nn=1.In what follows, we review the optimizations for well-knownframeworks based on maximum likelihood (ML), conditional max-imum likelihood (CML), and minimum classification error (MCE).We also review our most recently proposed framework for large mar-gin training [2].2.1. Maximum likelihood estimationThe simplest approach to parameter estimation in CD-HMMs maxi-mizes the joint likelihood of output and label sequences. The corre-sponding estimator is given byθML= arg maxθXnlog p(Xn, Yn) (3)For transition probabilities, ML estimates in this setting are obtainedfrom simple counts (assuming the training corpus provides phoneticlabel sequences). For GMM parameters, the EM algorithm providesiterative update rules that converge monotonically to local stationarypoints of the likelihood. The main attraction of the EM algorithm isthat no free parameters need to be tuned for its convergence.2.2. Conditional maximum


School:
Email:
New Password:
Confirm Password:

This preview shows page 1 out of 4 pages.

Please select your school