An Overview of Statistical Pattern Recognition Techniques for Speaker Verification

Home> Academic Documents> An Overview of Statistical Pattern Recognition Techniques for Speaker Verification

DOC PREVIEW

This preview shows page 1-2-3-27-28-29 out of 29 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 29 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 29 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 29 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 29 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 29 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 29 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 29 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

1An Overview of Statistical Pattern Recognition Techniques for Speaker Verification Amin Fazel, Student Member, IEEE, and Shantanu Chakrabartty, Senior Member, IEEE Michigan State University Abstract Even though the subject of speaker verification has been investigated for several decades now, numerous challenges and new opportunities in robust recognition techniques are still being explored. In this overview paper we first provide a brief introduction to statistical pattern recognition techniques that are commonly used in speaker verification. This includes an overview of some of the basic functional units: (a) speech acquisition and speech feature extraction; (b) speaker modeling; and (c) classification. The second part of the paper will present some the commonly used techniques which make real-world speaker verification systems robust in degradation due to the presence of ambient noise; channel variations, aging effects, and availability of limited training samples. The paper will conclude with discussions on some classical and emerging applications of speaker verification systems along with future trends and research opportunities. 1. Introduction Speaker verification is a popular biometric identification technique [1] used for authenticating and monitoring human subjects using their speech signal. The method is attractive for two main reasons: (a) it does not require direct contact with the individual, thus avoiding the hurdle of “perceived invasiveness” inherent in many biometric systems like iris and finger print recognition systems; (b) it does not require deployment of specialized signal transducers as microphones are now ubiquitous on most portable devices (cellular phones, PDAs and laptops). According to “The Biometric Industry Report, Forecasts and Analysis, 2006”, the market for speaker verification systems is expected to grow to approximately $100 million [2] and this is also evident from the large number of business ventures who are actively supporting product development in speaker verification/recognition [2]. The key applications that are driving this demand in speaker verification/identification technology are tele-commerce and forensics [3] where the objective is to automatically authenticate speakers of interest using his/her conversation over a voice channel (telephone or wireless phone). Also, with the ever increasing popularity in multimedia web-portals (like facebook and youtube), large repositories of archived spoken documents such as TV broadcasts, teleconference meetings, and personal video clips can be accessed through the internet. Searching for metadata like topic of discussion or participant names and genders from these multimedia documents would require automated technology like speaker verification and recognition. Traditionally, speaker verification systems have been classified into two different categories based on constraints imposed on the authentication process: (a) text-dependent speaker verification systems where the users are assumed to be ``cooperative” and use identical pass-phrase during the training and testing phase; and (b) text-independent speaker verification systems where no vocabulary constraints are imposed on the training and testing phase. Text-independent speaker verification systems can be further categorized into either2 Figure 1. An example illustrating one source of mismatch in a speaker verification system. Reference data could be acquired from the internet to train speaker verification models which are then stored on an authentication server. However, during testing, the data is acquired through a mobile interface which could beprone to background noise and wireless channel distortion. The central authentication system has to handle the mismatch between the training and testing conditions to be effective in real-life deployment. vocabulary constrained or unconstrained text-independent speaker verification systems. An unconstrained text-independent speaker verification system doesn’t assume any prior knowledge about the spoken text which is unlike a vocabulary constrained system. Also, while in the text-dependent case, the imposed constraints greatly improve the accuracy in the presence of channel/background noise, the lack of constraints in text-independent case makes it more challenging. For instance, since there are no constraints on the words which the speakers are allowed to use, the reference (what is spoken in training) and the test (what is uttered in actual use) utterances may have completely different content, and hence the verification system has to take into account phonetic mismatch issues. Other sources of challenges applicable to both text-independent and text-dependent speaker verification include compensating for changes in the acoustic environment such transducer or channel variations, and compensating for “within-speaker” variations like change in state of health, mood or aging. An example scenario is shown in Figure 1, where the enrollment data used in developing recognition models is acquired over the internet where as during verification or testing a mobile interface is used for acquiring the speech data. Compensating for such mismatch conditions between training and test conditions has been and still remains the most challenging problem [ 4 , 5 ] in the design of speaker verification systems. Over the last four decades, several speaker verification techniques have been reported to address this challenge [6,7,8,9,10,11] which makes a thorough review of the field challenging. Therefore, in this paper we will occasionally refer the reader to several excellent survey papers [12,13,14], books and monographs [15,16,17] for details of some the discussed techniques. The focus of this paper will be to survey some of the state-of-the-art statistical pattern recognition techniques which have been employed for designing speaker verification systems. Our emphasis will be to cover topics in noise-robustness and speaker3 Figure 2. Fundamental of speech biometrics: (a) Magnetic resonance image1 showing the anatomy of speech production apparatus. The biometric property of the speech signal is determined by shape of the vocal tract, orientation of the mouth, teeth and nasal passages. (b) Spectrograms corresponding to a sample utterance “fifty-six thirty-five seventy-two” for a male and female speaker. The horizontal axis represents the time, the vertical axis corresponds to the


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2-3-27-28-29 out of 29 pages.

Please select your school