Language Modeling Julia Hirschberg CS 4706 Approaches to Language Modeling Context Free Grammars Use in HTK Ngram Models Context Free Grammars Defined in formal language theory Terminals e g cat Non terminal symbols e g NP VP Start symbol e g S Rewriting rules e g S NP VP Start with start symbol rewrite using rules done when only terminals left A Fragment of English S NP VP VP V PP NP DetP N N cat mat V is PP Prep NP Prep on DetP the Input the cat is on the mat Derivations in a CFG S NP VP VP V PP NP DetP N N cat mat V is PP Prep NP Prep on DetP the S S Derivations in a CFG S NP VP VP V PP NP DetP N N cat mat V is PP Prep NP Prep on DetP the NP VP S NP VP Derivations in a CFG S NP VP VP V PP NP DetP N N cat mat V is PP Prep NP Prep on DetP the DetP N VP S NP DetP VP N Derivations in a CFG S NP VP VP V PP NP DetP N N cat mat V is PP Prep NP Prep on DetP the the cat VP S NP VP DetP N the cat Derivations in a CFG S NP VP VP V PP NP DetP N N cat mat V is PP Prep NP Prep on DetP the the cat V PP S NP VP DetP N the cat V PP Derivations in a CFG S NP VP VP V PP NP DetP N N cat mat V is PP Prep NP Prep on DetP the the cat is Prep NP S NP VP DetP N V the cat is PP Prep NP Derivations in a CFG S NP VP VP V PP NP DetP N N cat mat V is PP Prep NP Prep on DetP the the cat is on Det N S NP VP DetP N V the cat is PP Prep NP on DetP N Derivations in a CFG S NP VP VP V PP NP DetP N N cat mat V is PP Prep NP Prep on DetP the the cat is on the mat S NP VP DetP N V the cat is PP Prep NP on DetP N the mat A More Complicated Fragment of English S NP VP S VP VP V PP VP V NP VP V NP DetP NP NP N NP NP N PP Prep NP Mary likes the cat bowl N cat mat food bowl Mary V is likes sits Prep on in under DetP the a Using CFGs in Simple ASR Applications LHS of rules is a semantic category LIST show me I want can I see DEPARTTIME after around before HOUR morning afternoon evening HOUR one two three twelve am pm FLIGHTS a flight flights ORIGIN from CITY DESTINATION to CITY CITY Boston San Francisco Denver Washington HTK Grammar Format Variables start with e g city Terminals must be in capital letters e g FRIDAY TICKET X Y is concatenation e g I WANT X Y means X or Y e g WANT NEED X means optional e g ON FRIDAY X Kleene closure e g digit Examples city BOSTON NEWYORK WASHINGTON BALTIMORE time MORNING EVENING day FRIDAY MONDAY SENT START WHAT TRAINS LEAVE WHAT TIME CAN I TRAVEL IS THERE A TRAIN FROM TO city FROM TO city ON day time SENT END Problems for Larger Vocabulary Applications CFGs complicated to build and hard to modify to accommodate new data Add capability to make a reservation Add capability to ask for help Add ability to understand greetings Parsing input with large CFGs is slow for realtime applications So for large applications we use ngram models Next Word Prediction The air traffic control supervisor who admitted falling asleep while on duty at Reagan National Airport has been suspended and the head of the Federal Aviation Administration on Friday ordered new rules to ensure a similar incident doesn t take place FAA chief Randy Babbitt said he has directed controllers at regional radar facilities to contact the towers of airports where there is only one controller on duty at night before sending planes on for landings Babbitt also said regional controllers have been told that if no controller can be raised at the airport they must offer pilots the option of diverting to another airport Two commercial jets were unable to contact the control tower early Wednesday and had to land without gaining clearance Word Prediction How do we know which words occur together Domain knowledge Syntactic knowledge Lexical knowledge Can we model this knowledge computationally Simple statistical techniques do a good job when trained appropriately Most common way of constraining ASR predictions to conform to probabilities of word sequences in the language Language Modeling via N grams N Gram Models of Language Use the previous N 1 words in a sequence to predict the next word Language Model LM unigrams bigrams trigrams How do we train these models to discover cooccurrence probabilities Finding Corpora Corpora are online collections of text and speech Brown Corpus Wall Street Journal AP newswire web DARPA NIST text speech corpora Call Home Call Friend ATIS Switchboard Broadcast News TDT Communicator Tokenization Counting Words in Corpora What is a word e g are cat and cats the same word Cat and cat September and Sept zero and oh Is a word Uh Should we count parts of self repairs go to fr france How many words are there in don t Gonna Any token separated by white space from another In Japanese Thai Chinese text how do we identify a word Terminology Sentence unit of written language SLU Utterance unit of spoken language prosodic phrase Wordform inflected form as it actually appears in the corpus Lemma an abstract form shared by word forms having the same stem part of speech and word sense stands for the class of words with stem X Types number of distinct words in a corpus vocabulary size Tokens total number of words Simple Word Probability Assume a language has T word types and N tokens how likely is word y to follow word x Simplest model 1 T But is every word equally likely Alternative 1 estimate likelihood of y occurring in new text based on its general frequency of occurrence estimated from a corpus unigram probability ct y N But is every word equally likely in every context Alternative 2 condition the likelihood of y occurring on the context of previous words ct x y ct x Computing Word Sequence Sentence Probabilities Compute probability of a word given a preceding sequence P the mythical unicorn P the start P mythical start the P unicorn start the mythical Joint probability P wn 1 wn P wn wn 1 P wn 1 Chain Rule Decompose joint probability e g P w1 w2 w3 as P w1 w2 wn P w1 P w2 w1 P wn w1 to n 1 But the longer the sequence the less likely we are to find it in a training corpus P Most biologists and folklore specialists believe …
View Full Document
Unlocking...