Speech Processing 15-492/18-492Speech RecognitionLanguage ModelingBut not just acoustics• But not all phones are equi-probable• Find word sequences that maximizes• Using Bayes’ Law• Combine models– Us HMMs to provide– Use language model to provideLanguage PredictionsWhat are the most likely words?What are the most likely words?“the” more common than “loom”“the” more common than “loom”Different domains, different distributionsDifferent domains, different distributionsBus, timetable, 4:15, lateBus, timetable, 4:15, lateLCD, storage card, LCD, storage card, usbusbContext helps predictionContext helps predictionCarnegie …Carnegie …President …President …As quiet as a …As quiet as a …Markov ModelingLook at nLook at n--gram modelsgram modelsUnigram: Unigram: W_fW_fBigramBigram{W_1 | W_n{W_1 | W_n--1}1}Trigram {W_1 | W_nTrigram {W_1 | W_n--1, W_n1, W_n--3}3}NN--gram {W_1 | W_ngram {W_1 | W_n--1, … }1, … }But need lots of data to trainBut need lots of data to trainWhat is the word distributionWall Street Journal (1995)Wall Street Journal (1995)Total 22.5M word tokensTotal 22.5M word tokensTotal 508K different word typesTotal 508K different word types15K types appear more than 100 times15K types appear more than 100 times45% types appear only once.45% types appear only once.Top: the, of, to, a, in, and, that, for, is, onTop: the, of, to, a, in, and, that, for, is, onsaid(16), Mr(17), million(24), company(39)said(16), Mr(17), million(24), company(39)New tokens per dayNeed lots of data to trainAs we increase the NAs we increase the N--gram gram We need much more dataWe need much more dataVocabulary of 50K words 125T trigramsVocabulary of 50K words 125T trigramsAt least 40T words (if At least 40T words (if equiequi--probable)probable)About 5000 years of WSJAbout 5000 years of WSJSimplifying AssumptionsLimit vocabularyLimit vocabulary< 64K< 64KMake them all UPPER CASEMake them all UPPER CASERemove punctuationRemove punctuationPeople don’t say punctuationPeople don’t say punctuationMaybe make into phrases at punctuationMaybe make into phrases at punctuationHave a “unknown word” tokenHave a “unknown word” tokenReplace all low frequency words with UNKReplace all low frequency words with UNKCollapse similar wordsCollapse similar wordsAll numbers to NUMAll numbers to NUMCall Cities to CITY ….Call Cities to CITY ….Still not enough dataBackoffBackoff::If no trigram data use If no trigram data use bigrambigramdatadataIf no If no bigrambigramdata use unigramdata use unigramSmoothing:Smoothing:Assume there is at least 1 Assume there is at least 1 occurencesoccurencesAllow nonAllow non--integer frequenciesinteger frequencies“Good“Good--Turing” smoothingTuring” smoothingIf (Numof(nIf (Numof(n--1gram) < threshold)1gram) < threshold)F(ngramF(ngram) = Numof(n) = Numof(n--1gram)*P(n1gram)*P(n--1gram)1gram)How good is a modelYou build language model You build language model How good is it:How good is it:Test it in the ASR (takes time)Test it in the ASR (takes time)Have abstract measureHave abstract measureEntropy and Perplexity• Entropy– Related to predictability– Q is number of words– N is order of ngram• For sufficiently large Q• PerplexityPerplexityLarger number, harder problemLarger number, harder problemSort of a average branching factorSort of a average branching factorIf 20, about 20 choices per wordIf 20, about 20 choices per wordIf 300, about 300 choices per wordIf 300, about 300 choices per word20 is typically an “easy” task20 is typically an “easy” task300 is typically an “hard” task300 is typically an “hard” taskSometimes its only sometimes hardSometimes its only sometimes hardI want to go to X.I want to go to X.Lower perplexity measures give better recognitionLower perplexity measures give better recognitionNot true, but there is a correlationNot true, but there is a correlationBut surely we can do betterJust using the last two words?Just using the last two words?Syntax, semantics …Syntax, semantics …Writing grammars is hard Writing grammars is hard Beyond simple tasksBeyond simple tasksTraining grammars is even harderTraining grammars is even harderSemantics is even harder than thatSemantics is even harder than thatSome LM improvementsLooking at more than previous two wordsLooking at more than previous two wordsReplace words with typesReplace words with typesI want to go from City to CityI want to go from City to CityTriggerTrigger--based modelsbased modelsIf you see a word you’ll likely see related onesIf you see a word you’ll likely see related ones“president” triggers “vice“president” triggers “vice--president”president”Model CombinationUse background modelUse background modelGeneral (for domain)General (for domain)Use specific model to adaptUse specific model to adaptCombination byCombination bySimple linear weightsSimple linear weightsMaximum EntropyMaximum EntropyCARTCARTContext dependent modelsSwitch LM in dialog systemSwitch LM in dialog systemBuild separate models from different statesBuild separate models from different statesState1: Where do you want to go to?State1: Where do you want to go to?State2: When do you want to leave?State2: When do you want to leave?State3: When do you want to arrive?State3: When do you want to arrive?What about OOVs?OOV “out of vocabulary”OOV “out of vocabulary”Words not in the lexiconWords not in the lexiconIgnore themIgnore themThey might be irrelevantThey might be irrelevantTry to recognize themTry to recognize themThe might be namesThe might be namesAvoid themAvoid themDesign your system so there aren’t any important Design your system so there aren’t any important onesonesSummaryLanguage ModelsLanguage ModelsBayesBayesequationequationNN--gramsgramsSmoothing, Smoothing, backoffbackoff, adaptation,
View Full Document