TAMU CSCE 689 - kuhn2000speakerAdaptationEigenvoice - D708390

Home> Schools> Texas A&M University> Computer Sci. & Engr. (CSCE) > CSCE 689> kuhn2000speakerAdaptationEigenvoice

DOC PREVIEW

TAMU CSCE 689 - kuhn2000speakerAdaptationEigenvoice

School name Texas A&M University

Course Csce 689- Special Topics

Pages 13

This preview shows page 1-2-3-4 out of 13 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 13 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 13 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 13 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 13 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 13 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING VOL 8 NO 6 NOVEMBER 2000 695 Rapid Speaker Adaptation in Eigenvoice Space Roland Kuhn Jean Claude Junqua Member IEEE Patrick Nguyen and Nancy Niedzielski Abstract This paper describes a new model based speaker adaptation algorithm called the eigenvoice approach The approach constrains the adapted model to be a linear combination of a small number of basis vectors obtained offline from a set of reference speakers and thus greatly reduces the number of free parameters to be estimated from adaptation data These eigenvoice basis vectors are orthogonal to each other and guaranteed to represent the most important components of variation between the reference speakers Experimental results for a small vocabulary task letter recognition given in the paper show that the approach yields major improvements in performance for tiny amounts of adaptation data For instance we obtained 16 relative improvement in error rate with one letter of supervised adaptation data and 26 relative improvement with four letters of supervised adaptation data After a comparison of the eigenvoice approach with other speaker adaptation algorithms the paper concludes with a discussion of future work Index Terms Eigenvoice approach principal component analysis speaker adaptation speaker clustering I INTRODUCTION W HEN a speaker dependent SD system trained on is tested on other speech from a given speaker speech data from the error rate may be as low as a half to a third that of a similar speaker independent SI speech recognition system tested on the same data 18 29 The goal of research on speaker adaptation is to achieve performance on each new speaker approaching that of an SD system for that speaker while avoiding the need for unacceptably large amounts of adaptation data for each new speaker The meaning of unacceptably large depends on the application Requiring the purchaser of a dictation system to train the system for 30 to 40 min may be acceptable since he or she is planning to use the system for years to come On the other hand in many commercially attractive applications such as ordering items over the telephone one can only count on a few seconds of unsupervised speech This paper addresses the latter case describing a new modelbased rapid speaker adaptation algorithm called the eigenvoice approach Model based algorithms differ from other adaptation algorithms such as speaker normalization 39 in that they adapt to a new speaker by modifying the parameters of the system s speaker model Standard model based algorithms such as maximum a posteriori MAP adaptation 14 15 42 and maximum likelihood linear regression MLLR adaptation Manuscript received May 17 1999 revised April 14 2000 The associate editor coordinating the review of this manuscript and approving it for publication was Dr Rafid A Sukkar The authors are with the Panasonic Speech Technology Laboratory Panasonic Technologies Inc Santa Barbara CA 93105 USA e mail kuhn stl research panasonic com Publisher Item Identifier S 1063 6676 00 09262 2 12 31 32 require significant amounts of adaptation data from the new speaker in order to perform better than a similar SI system However in the last two years model based algorithms that achieve rapid speaker adaptation have been devised 13 17 18 25 27 These speaker space algorithms constrain the adapted model to be a linear combination of a small number of basis vectors obtained offline from a set of reference speakers and thus greatly reduce the number of free parameters to be estimated from adaptation data These algorithms are related to an older approach speaker clustering 11 24 In their application of a priori constraints derived from reference speakers they also resemble extended MAP EMAP 28 38 which employs precomputed correlations between acoustic units to estimate unseen distributions though the details are quite different All these model based speaker adaptation algorithms and some others are discussed in the paper Unlike other speaker space algorithms the eigenvoice approach finds basis vectors that are orthogonal to each other and guaranteed to represent the most important components of variation between the reference speakers Furthermore the approach allows the number of basis vectors employed during recognition i e the number of degrees of freedom for the adapted model to vary dynamically We give experimental results for the eigenvoice approach on a small vocabulary task letter recognition showing that eigenvoice based Gaussian mean adaptation can produce major improvements in performance for tiny amounts of adaptation data For instance we obtained 16 relative improvement in error rate with one letter of supervised adaptation data and 26 relative improvement with four letters of supervised adaptation data We look at what the eigenvoices tell us about inter speaker variation Finally we outline future work discussing how the approach could be extended to estimate other HMM parameters besides the Gaussian means and how it could be modified for application in large vocabulary systems II EIGENVOICE APPROACH A Eigenfaces There are many examples of families of patterns for which it is possible to obtain a useful systematic characterization Often the initial motivation might be no more than the intuitive notion that the family is low dimensional that is in some sense any given member might be represented by a small number of parameters Possible candidates for such families of patterns are abundant both in nature and in the literature Such examples include turbulent flows human speech and the subject of this correspondence human faces 23 pg 103 Our work on eigenvoices was inspired by current research on face recognition As the quotation above suggests there are 1063 6676 00 10 00 2000 IEEE 696 IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING VOL 8 NO 6 NOVEMBER 2000 hidden similarities between the study of faces and the study of voices techniques applied in one area may be helpful in the other Face recognition is the problem of trying to match a given two dimensional 2 D face image to a set of face images in a database Initially researchers applied general purpose image processing techniques to this problem However building on the work of Kirby and Sirovich 23 they soon realized that the dimensionality of face space the space of variation between photographs of human faces with the same orientation and scale lit in the same way is much smaller than the

View Full Document