TAMU CSCE 689 - kuhn2000speakerAdaptationEigenvoiceSLIDES - D2387965

Home> Schools> Texas A&M University> Computer Sci. & Engr. (CSCE) > CSCE 689> kuhn2000speakerAdaptationEigenvoiceSLIDES

DOC PREVIEW

TAMU CSCE 689 - kuhn2000speakerAdaptationEigenvoiceSLIDES

School name Texas A&M University

Course Csce 689- Special Topics

Pages 30

This preview shows page 1-2-14-15-29-30 out of 30 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 30 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 30 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 30 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 30 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 30 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 30 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 30 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Heeyoul Henry Choi Dept of Computer Science Texas A M University hchoi cs tamu edu Introduction Speaker Adaptation Eigenvoice Comparison with others MAP MLLR EMAP RMP CAT RSW Experiments Future work Summary Speaker dependent SD system Speaker independent SI system Speaker Adaptation Finding SD system for a new speaker with small data This paper is about making the adaptation faster based on eigenvoice approach Model based algorithms Adapt to a new speaker by modifying the parameters of the system s speaker model Maximum a posteriori MAP Maximum likelihood linear regression MLLR Require significant amounts of adaptation data from the new speaker Speaker space algorithm Constrain the adapted model to be a linear combination of a small number of basis vectors from the reference speakers Faster and robust Related to speaker clustering in fact that they reduce the parameter dimension to search Resemble extended MAP EMAP in fact that they use a priori information from reference speakers Actually prior information is used to reduce the parameter space Eigenvoice is one of these algorithm Eigenvoice Finds basis vectors that are orthogonal to each other Efficient in the sense of variation Has all property of principal component analysis PCA PCA is applied to the parameter space Eigenvoice is an analogy to eigenface in face images Face is a weighted sum of eigenfaces which are eigenvectors of face images PCA Ordered by eigenvalues Guarantee the minimized mean square error Face recognition speaker recognition How Speaker recognition vs speech recognition Efficient representation for speakers A good SD model for the new speaker Speech recognition of the new speaker HMMs EM PCA Supervector Model parameters The means of HMM output Gaussians Not voice Actually eigen model parameter Instead of PCA independent component analysis ICA Factorialvoice as in factorialface linear discriminant analysis LDA generalized eigenvoice They used correlation matrix instead of covariance matrix Hidden states of every SD model are from SI model Insufficient data for SD but enough for SI SI small data is enough to make SD model Hidden states means kind of Speaker invariant features Each speaker s characters are from the mixture of Gaussian for each state after adaptation This makes sense to build new speakers model with the same states and transition probabilities as SI model Now the adaptation procedure is redefined as finding K parameters Computationally cheap Only the weights of eigenvoices instead of all parameters Requires only a small amount of adaptation data Think about the curse of dimensionality And robust against noise The discarded eigenvectors corresponding eigenvalues might be about noise to small Initialization of weights and the others variances and transition probabilities are from SI model The parameters for the new speaker is The problem is to estimate the weight w j from data Maximum likelihood eigen decomposition MLED Gaussian mean adaptation in a continuous density hidden Markov model CDHMM Likelihood lamda means of Gaussian P O P O m s m s P m s P O m s m s Auxiliary function Q P O m s log P O m s m s P O m s P ot m s P ot m s P m s t t P ot m s is a Gaussian distribution Finally In the Gaussians the mean estimates are Parameters are reduced from the D dimensional means into the Kdimensional weights K D Model based algorithm Maximum a Posteriori MAP Uses the prior information of parameters in Bayes rule Updates only the parameters of Gaussians that have observations ot The number of parameters are large Maximum Likelihood Linear Regression MLLR Update all the parameters which is formulated by the linear regression Much less constrained by prior knowledge A little by SI model The number of parameters are small the transformation matrix Eigenvoice puts a heavy emphasis on prior knowledge by eigenvectors updates all the parameters by the weights Eigenvector is D dimension The number of parameters are smaller than MLLR Extended Maximum a Posteriori EMAP Faster convergence by the correlations between observation Update all the correlated parameters with one observation as much as they are related Regression based model prediction RMP EMAP to CDHMM Still faster than MAP Better performance than MAP Kind of a mixture of MLLR and MAP Hard speaker clustering Clusters of reference speakers SI models cf codewords When the new speaker data is available choose one model And then MLLR can be used Soft speaker clustering The new speaker s model is a linear combination of reference speaker s models Clustering MLLR Clustering works as prior to MLLR which has a little prior information Reference Speaker Weighting RSW The new speaker s model is a linear combination of the reference models The rest part is same as eigenvoice method Good with medium or large vocabulary systems For class p and speaker r c p r is the centroid for each speaker speaker dependent v is speaker independent For new speaker model from m the vector m S is obtained by means of the ML equivalent to finding the weights As the number R of reference speakers grows it becomes more expensive in terms of memory and computation Vowel Classification PCA to the parameters of a vowel classifier MoG Database R 120 reference speakers and 30 test speakers D 2808 parameter dimension 26 characters x 6 states single Gaussian x 18 features PCA 0 K eigenvoices and MLED 2 iterations In figure MLED 5 MLED with K 5 MLED 10 MAP MAP with MLED 10 model as prior Otherwise SI is prior Sensitiveness to changes in the reference speaker models it is by no means universally true I T Jolliffe 1st eigenvector sex strong 2nd eigenvector amplitude strong 3rd eigenvector second formant maybe 4th eigenvector changes in pitch maybe Extensions of the Eigenvoice Approach Hybrid such as MLED MAP in Fig 2 Done How about allowing K to rise as the data increases MLED MLLR Discriminative training of the reference SD models Environment adaptation LDA rather than PCA Learning basis vectors by ML rather than PCA Eigenvoice adaptation of state transition probabilities and Gaussian standard deviations I don t think so There are definitely some correlations between states and Gaussian though Training models for Large Vocabulary Systems There will be insufficient data per reference speaker The computational and storage requirements of this na ve extention of small vocabulary methodology would be onerous Principals Inter speaker variability in K space Intra speaker variability by the

View Full Document