Columbia CS 4706 - Building an ASR using HTK - D3011657

Home> Schools> Columbia University> (CS) > CS 4706> Building an ASR using HTK

Columbia CS 4706 - Building an ASR using HTK

Course Cs 4706- Spoken Language Processing

Pages 40

Download Save

Unformatted text preview:

Building an ASR using HTK CS4706 Fadi Biadsy Mar 23th 2011 Outline Speech Recognition Feature Extraction Modeling Hidden HMM Markov Models HMM 3 basic problems Toolkit HTK Steps 2 Speech for building an ASR using HTK Automatic Speech Recognition ASR Speech ASR 3 signal to text Transcript It s hard to recognize speech It s hard to wreck a nice beach Contextual effects Speech sounds vary within contexts How do you do Half and half t in butter vs bat Within speaker variability Speaking rate Intensity F0 contour Voice quality Speaking Style Formal vs spontaneous register Speaker State Emotion Sleepy Drunk Between speaker variability Gender and age Accents Dialects native vs non native 4 Scottish vs American r in some contexts Environment variability Background noise Microphone type Outline Speech Recognition Feature Extraction Modeling Hidden HMM Markov Models HMM 3 basic problems Toolkit HTK Steps 5 Speech for building an ASR using HTK Feature Extraction Wave form Spectrogram Need representation of speech signal that is robust to acoustic variation but sensitive to linguistic content 6 Feature Extraction Extract features from short frames frame period 10ms 25ms frame size a sequence of features 7 Feature Extraction MFCC Mel Scale Approximate the unequal sensitivity of human hearing at different frequencies Based 8 on pitch perception Feature Extraction MFCC MFCC Mel frequency cepstral coefficient Widely 1 Take used in speech recognition the Fourier transform of the signal spectrum the powers of the spectrum to the mel scale and take the log 2 Map cosine transform of the mel log amplitudes 4 The MFCCs are the amplitudes of the resulting spectrum 3 Discrete 9 Feature Extraction MFCC Extract a feature vector from each frame 12 MFCC coefficients 1 normalized energy 13 features Delta MFCC 13 Delta Delta MFCC 13 Total 39 features Inverted 10 MFCCs 39D Feature vector Outline Speech Recognition Feature Extraction Modeling Hidden HMM Markov Models HMM 3 basic problems Toolkit HTK Steps 11 Speech for building an ASR using HTK Markov Chain Weighted present 12 finite state acceptor Future is independent of the past given the Hidden Markov Model HMM HMM is a Markov chain emission probability function for each state Markov HMM Chain M A B Pi A Transition Matrix B Observation Distributions Pi Initial state probabilities 13 HMM Example d 14 aa n aa Speech Recognition General Expression S argmaxs P S O M where S W1 W2 WN O o1 o2 oT S argmaxS P O S M P S M 15 HMM 3 Basic Problems I Evaluation II Decoding III Training 16 HMM I Evaluation an observation sequence O and a model M how can we efficiently compute Given P O M the likelihood of O given the model 1 N 17 HMM II Decoding Given an observation sequence O and a model M How can we obtain the most likely state sequence Q q1 q2 qt 18 Viterbi Algorithm Efficient algorithm for decoding O TN 2 d aa n aa Start Start End End uw uw d aa n aa dana 19 HMM III Training How do we estimate the model parameters M A B Pi to maximize P O M Baum Welch algorithm d n aa 1 Transition Matrix A 2 Emission probability distribution dana d aa n aa Estimate 20 aa Outline Speech Recognition Feature Extraction Modeling Hidden HMM Markov Models HMM 3 basic problems Toolkit HTK Steps 21 Speech for building an ASR using HTK Hidden Markov Model Toolkit HTK HTK is a research toolkit for building and manipulating HMMs Primarily Tools 22 designed for building HMM based ASR systems for examples Extracting MFCC features HMM algorithms Grammar networks Speaker Adaptation Steps for building ASR Voice operated interface for phone dialing Examples Dial three three two six five four Phone Woodland Call Steve Young Grammar 23 digit ONE TWO THREE FOUR FIVE SIX SEVEN EIGHT NINE OH ZERO name JOOP JANSEN JULIAN ODELL DAVE OLLASON PHIL WOODLAN SENT START DIAL digit PHONE CALL name SENT END Convert Grammar to Network HParse 24 Training the system Lab files S0001 S0001 ONE ONE VALIDATED VALIDATED ACTS ACTS OF OF SCHOOL SCHOOL DISTRICTS DISTRICTS S0002 S0002 TWO TWO OTHER OTHER CASES CASES ALSO ALSO WERE WERE UNDER UNDER ADVISEMENT ADVISEMENT S0003 S0003 BOTH BOTH FIGURES FIGURES WOULD WOULD GO GO HIGHER HIGHER IN IN LATER LATER YEARS YEARS 25 Wave files AA AA AA CALL CALL DIAL DIAL EIGHT EIGHT PHONE PHONE ah ah sp sp ax sp ax sp ey ey sp sp kk ao ao ll sp sp dd ay ay ax ax ll sp sp ey t sp ey t sp ff ow ow nn sp sp Words to Phones using HLEd HTK data 26 scripting language is used to generate phonetic transcription for all training Extracting MFCC using HCopy For each wave file extract MFCC features wav 27 mfc files Specifying Monophone HMM Topology 5 states 3 emitting states r S1 Flat S2 S3 S4 S5 Start Mean and Variance are initialized as the global mean and variance of all the data 28 Training HERest For each training pair of files mfc lab 1 Concatenate the corresponding monophone HMMs 2 Use the Baum Welch Algorithm to train the HMMs given the MFC features One validated acts of school districts w S 1 29 S 2 S 3 n ah S 4 S 5 S 1 S 2 S 3 S 4 S 5 S 1 S 2 S 3 S 4 S 5 Training So far we have all monophone models trained Train 30 the short pause sp model Handling Multiple Pronunciations HVite The dictionary contains multiple pronunciations for some words Forced alignment d ey t ax Run Run Viterbi Viterbi to to get get the the best best pronunciation pronunciation that that matches matches the the acoustics acoustics ae ae 31 dx dx Handling Multiple Pronunciations The dictionary contains multiple pronunciations for some words Forced alignment d ey t ax Run Run Viterbi Viterbi to to get get the the best best pronunciation pronunciation that that matches matches the the acoustics acoustics ae ae 32 dx dx Retrain After getting the best pronunciations Train again using Baum Welch algorithm using the best pronunciations 33 Creating Triphone Models using HLEd Phones may be realized differently in some contexts Build context dependent acoustic models HMMs Triphones Make One preceding and succeeding phone triphones from monophones Generate a list of all the triphones for which there is at least one example in the training data s l ow b l aa p l aa jh oy s f iy t 34 Tie Triphone HDMan Clustering All states in the same leaf will be tied t ih t iy ao r ax t oh ao r iy t uh t uw sh n t sh n z ch ih l ay oh l ay oh r 35 by growing decision trees t ae t ae r t ae t ae t ae sh n t ay oh l L Class Stop n y y n L Nasal n R Glide n y R Nasal y After Tying Train the acoustic models again using Baum Welch algorithm HERest

View Full Document


School:
Email:
New Password:
Confirm Password:

Columbia CS 4706 - Building an ASR using HTK

Sign up for free to view:

Please select your school