Unformatted text preview:

Automatic Speech Recognition An Overview Julia Hirschberg CS 4706 special thanks to Roberto Pieraccini 1 Recreating the Speech Chain DIALOG SEMANTICS SPOKEN LANGUAGE UNDERSTANDING SPEECH RECOGNITION SPEECH SYNTHESIS DIALOG MANAGEMENT SYNTAX LEXICON MORPHOLOG Y PHONETICS INNER EAR ACOUSTIC NERVE VOCAL TRACT ARTICULATORS 2 Speech Recognition the Early Years 1952 Automatic Digit Recognition AUDREY Davis Biddulph Balashek Bell Laboratories 3 1960 s Speech Processing and Digital Computers AD DA converters and digital computers start appearing in the labs James Flanagan Bell Laboratories 4 The Illusion of Segmentation or Why Speech Recognition is so Difficult user Roberto attribute telephone num attribute telephone numvalue 7360474 value 7360474 user Roberto VP NP NP MY IS NUMBER m I n m r i b THREE SEVEN SEVEN ZERO NINE s e v nth rE n I n zE o r TWO t FOUR s ev n f O 5 r The Illusion of Segmentation or Ellipses and Anaphors Why Speech Recognition is so Difficult Limited vocabulary Multiple Interpretations user Roberto attribute telephone num attribute telephone numvalue 7360474 value 7360474 Speaker Dependency user Roberto Word variations VP NP Word confusability NP MY IS NUMBER THREE SEVEN ZERO NINE Context dependency SEVEN TWO Coarticulation FOUR Noise reverberation m I n m r i b s e v nth rE n I n z E o Intra speaker t s e v variability f O r r n 6 1969 Whither Speech Recognition General purpose speech recognition seems far away Socialpurpose speech recognition is severely limited It would seem appropriate for people to ask themselves why they are working in the field and what they can expect to accomplish It would be too simple to say that work in speech recognition is carried out simply because one can get money for it That is a necessary but not sufficient condition We are safe in asserting that speech recognition is attractive to money The attraction is perhaps similar to the attraction of schemes for turning water into gasoline extracting gold from the sea curing cancer or going to the moon One doesn t attract thoughtlessly given dollars by means of schemes for cutting the cost of soap by 10 To sell suckers one uses deceit and offers glamour Most recognizers behave not like scientists but like mad inventors or untrustworthy engineers The typical recognizer gets it into his head that he can solve the problem The basis for this is either individual inspiration the mad inventor source of knowledge or acceptance of untested rules schemes or information the untrustworthy engineer approach The Journal of the Acoustical Society of America June 1969 J R Pierce Executive Director Bell Laboratories 7 1971 1976 The ARPA SUR project Despite anti speech recognition campaign led by Pierce Commission ARPA launches 5 year Spoken Understanding Research program Goal 1000 word vocabulary 90 understanding rate near real time on 100 mips machine 4 Systems built by the end of LESSON the program LEARNED Hand built knowledge does not scale up SDC 24 BBN s HWIM 44 Need of a global optimization criterion CMU s Hearsay II 74 CMU s HARPY 95 but 80 times real time Rule based systems except for Harpy Engineering approach search network of all the possible utterances Raj Reddy CMU 8 Lack of clear evaluation criteria ARPA felt systems had failed Project not extended Speech Understanding too early for its time Need a standard evaluation method 9 1970 s Dynamic Time Warping The Brute Force of the Engineering Approach TEMPLATE WORD 7 T K Vyntsyuk 1968 H Sakoe S Chiba 1970 Isolated Words Speaker Dependent Connected Words Speaker Independent Sub Word Units UNKNOWN WORD 10 1980s The Statistical Approach Based on work on Hidden Markov Models done by Leonard Baum at IDA Princeton in the late 1960s Purely statistical approach pursued by Fred Jelinek and Jim Baker IBM T J Watson Research Foundations of modern speech recognition engines W arg max P A W P W W Acoustic HMMs a11 S1 a22 a12 S2 Word Tri grams a33 a23 Fred Jelinek S3 P wt wt 1 wt 2 Jim Baker No Data Like More Data Whenever I fire a linguist our system performance improves 1988 Some of my best friends are linguists 2004 11 1980 1990 Statistical approach becomes ubiquitous Lawrence Rabiner A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition Proceeding of the IEEE Vol 77 No 2 February 1989 12 1980s 1990s The Power of Evaluation 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 HOSTING MIT SPEECHWORKS SPOKEN STANDARDS DIALOG INDUSTRY NUANCE SRI APPLICATION DEVELOPERS TOOLS Pros and Cons of DARPA programs STANDARDS PLATFORM INTEGRATORS STANDARDS VENDORS Continuous incremental improvement Loss of bio diversity TECHNOLOGY 13 Today s State of the Art Low noise conditions Large vocabulary 20 000 60 000 words or more Speaker independent vs speaker dependent Continuous speech vs isolated word Multilingual conversational World s best research systems Human human speech 13 20 Word Error Rate WER Human machine or monologue speech 3 5 WER 14 Building an ASR System Build a statistical model of the speech to words process Collect lots of speech and transcribe all the words Train the model on the labeled speech Paradigm Supervised Machine Learning Search The Noisy Channel Model 15 The Noisy Channel Model Search through space of all possible sentences Pick the one that is most probable given the waveform 16 The Noisy Channel Model II What is the most likely sentence out of all sentences in the language L given some acoustic input O Treat acoustic input O as sequence of individual acoustic observations O o1 o2 o3 ot Define a sentence as a sequence of words W w1 w2 w3 wn 17 Noisy Channel Model III Probabilistic implication Pick the highest probable sequence W argmax P W O W L We can use Bayes rule to rewrite this P O W P W W argmax P O W L Since denominator is the same for each candidate sentence W we can ignore it for the argmax W arg max P O W P W W L 18 Speech Recognition Meets Noisy Channel Acoustic Likelihoods and LM Priors 19 Components of an ASR System Corpora for training and testing of components Representation for input and method of extracting Pronunciation Model Acoustic Model Language Model Feature extraction component Algorithms to search hypothesis space efficiently 20 Training and Test Corpora Collect corpora appropriate for recognition task at hand Small speech phonetic transcription to associate sounds with symbols Acoustic Model Large 60 hrs speech orthographic transcription to associate words with sounds Acoustic Model Very


View Full Document

Columbia CS 4706 - Automatic Speech Recognition

Loading Unlocking...
Login

Join to view Automatic Speech Recognition and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Automatic Speech Recognition and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?