Columbia COMS W4705 - Automatic Speech Recognition - D2546526

Home> Schools> Columbia University> (COMS) > COMS W4705> Automatic Speech Recognition

DOC PREVIEW

Columbia COMS W4705 - Automatic Speech Recognition

School name Columbia University

Course Coms W4705- Natural Language Processing

Pages 26

This preview shows page 1-2-3-24-25-26 out of 26 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 26 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 26 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 26 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 26 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 26 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 26 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 26 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Slide 1Slide 2What is speech recognition?“It’s hard to ... recognize speech/wreck a nice beach”Slide 5Again, the Noisy Channel ModelWhat do we need to build use an ASR system?Training and Test CorporaRepresenting the SignalSlide 10Slide 11Pronunciation ModelAcoustic ModelsSlide 14Language ModelSlide 16Search/DecodingSlide 18Varieties of Speech RecognitionChallenges for TranscriptionChallenges for UnderstandingAn Unsuccessful DialogueSlide 23Slide 24SummaryDisfluencies and Self-RepairsCS 4705Automatic Speech Recognition•Opportunity to participate in a new user study for Newsblaster and get $25-$30 for 2.5-3 hours of time respectively.•http://www1.cs.columbia.edu/~delson/study.html•More opportunities will be coming….What is speech recognition?•Transcribing words?•Understanding meaning?•Today:–Overview ASR issues–Building an ASR system–Using an ASR system–Future research“It’s hard to ... recognize speech/wreck a nice beach”•Speaker variability: within and across•Recording environment varies wrt noise•Transcription task must handle all of this and produce a transcript of what was said, from limited, noisy information in the speech signal–Success: low word error rate (WER) •WER = (S+I+D)/N * 100–Thesis test vs. This is a test. 75% WER•Understanding task must do more: from words to meaning–Measure concept accuracy (CA) of string in terms of accuracy of recognition of domain concepts mentioned in string and their valuesI want to go from Boston to Baltimore on September 29– Domain concepts Values–source city Boston–target city Baltimore–travel date September 29–Score recognized string “Go from Boston to Washington on December 29” (1/3 = 33% CA)–“Go to Boston from Baltimore on September 29”Again, the Noisy Channel ModelInput to channel: spoken sentence s– Output from channel: an observation O–Decoding task: find s’ = P(s|O)–Using Bayes Rule–And since P(O) doesn’t change for any hypothetical s’–s’ = P(O|s) P(s) –P(O|s) is the observation likelihood, or Acoustic Model, and P(s) is the prior, or Language ModelSourceNoisy Channel DecodermaxargVs)()()|(maxargOPsPsOPVsmaxargVsWhat do we need to build use an ASR system?•Corpora for training and testing of components•Feature extraction component•Pronunciation Model•Acoustic Model•Language Model•Algorithms to search hypothesis space efficientlyTraining and Test Corpora•Collect corpora appropriate for recognition task at hand–Small speech + phonetic transcription to associate sounds with symbols (Acoustic Model)–Large (>= 60 hrs) speech + orthographic transcription to associate words with sounds (Acoustic Model)–Very large text corpus to identify unigram and bigram probabilities (Language Model)Representing the Signal•What parameters (features) of the speech input–Can be extracted automatically–Will preserve phonetic identity and distinguish it from other phones–Will be independent of speaker variability and channel conditions–Will not take up too much space•Speech representations (for [ae] in had): –Waveform: change in sound pressure over time–LPC Spectrum: component frequencies of a waveform–Spectrogram: overall view of how frequencies change from phone to phone•Speech captured by microphone and sampled (digitized) -- may not capture all vital information•Signal divided into frames•Power spectrum computed to represent energy in different bands of the signal–LPC spectrum, Cepstra, PLP –Each frame’s spectral features represented by small set of numbers•Frames clustered into ‘phone-like’ groups (phones in context) -- Gaussian or other models•Why this works?–Different phonemes have different spectral characteristics•Why it doesn’t work?–Phonemes can have different properties in different acoustic contexts, spoken by different people …–Nice white ricePronunciation Model•Models likelihood of word given network of candidate phone hypotheses (weighted phone lattice)•Allophones: butter vs. but•Multiple pronunciations for each word•Lexicon may be weighted automaton or simple dictionary•Words come from all corpora; pronunciations from pronouncing dictionary or TTS systemAcoustic Models•Model likelihood of phones or subphones given spectral features and prior context•Use pronunciation models•Usually represented as HMM–Set of states representing phones or other subword units–Transition probabilities on states: how likely is it to see one phone after seeing another?–Observation/output likelihoods: how likely is spectral feature vector to be observed from phone state i, given phone state i-1?•Initial estimates for•Transition probabilities between phone states•Observation probabilities associating phone states with acoustic examples•Re-estimate both probabilities by feeding the HMM the transcribed speech training corpus (forced alignment)•I.e., we tell the HMM the ‘right’ answers -- which words to associate with which sequences of sounds•Iteratively retrain the transition and observation probabilities by running the training data through the model and scoring output until no improvementLanguage Model•Models likelihood of word given prior word and of entire sentence•Ngram models:–Build the LM by calculating bigram or trigram probabilities from text training corpus–Smoothing issues very important for real systems•Grammars–Finite state grammar or Context Free Grammar (CFG) or semantic grammar•Out of Vocabulary (OOV) problem•Entropy H(X): the amount of information in a LM, grammar–How many bits will it take on average to encode a choice or a piece of information?–More likely things will take fewer bits to encode •Perplexity 2H: a measure of the weighted mean number of choice points in e.g. a language modelSearch/Decoding•Find the best hypothesis P(O|s) P(s) given–Lattice of subword units (Acoustic Model)–Segmentation of all paths into possible words (Pronunciation Model)–Probabilities of word sequences (Language Model)•Produces a huge search space: How to reduce?–Lattice minimization and determinization–Forward algorithm: sum of all paths leading to a state–Viterbi algorithm: max of all paths leading to a state–Forward-backward (Baum-Welch, Expectation-Maximization) algorithm: computes probability of sequence at any state in search space–Beam search: prune the latticeVarieties of

View Full Document