The Expectation Maximization (EM) AlgorithmGeneral IdeaSlide 3For Hidden Markov ModelsSlide 5Slide 6Grammar ReestimationEM by Dynamic Programming: Two VersionsExamples of EMOur old friend PCFGViterbi reestimation for parsingTrue EM for parsingAnalogies to a, b in PCFG?“Inside Probabilities”Compute Bottom-Up by CKYSlide 16Slide 17Slide 18Slide 19Slide 20Compute probs bottom-up (CKY)Inside & Outside ProbabilitiesSlide 23Slide 24Slide 25Slide 26Compute probs bottom-up (gradually build up larger blue “inside” regions)Compute probs top-down (uses probs as well)Slide 29Details: Compute probs bottom-upDetails: Compute probs bottom-up (CKY)Details: Compute probs top-down (reverse CKY)Details: Compute probs top-down (reverse CKY)Slide 34What Inside-Outside is Good ForSlide 36Slide 37Slide 38Slide 39Slide 40Slide 41Slide 42Slide 43600.465 - Intro to NLP - J. Eisner 1The Expectation Maximization (EM) Algorithm … continued!600.465 - Intro to NLP - J. Eisner 2Repeat until convergence!General IdeaStart by devising a noisy channelAny model that predicts the corpus observations via some hidden structure (tags, parses, …)Initially guess the parameters of the model!Educated guess is best, but random can workExpectation step: Use current parameters (and observations) to reconstruct hidden structureMaximization step: Use that hidden structure (and observations) to reestimate parameters600.465 - Intro to NLP - J. Eisner 3Guess ofunknownparameters(probabilities)initialguessM stepObserved structure(words, ice cream)General IdeaGuess of unknownhidden structure(tags, parses, weather)E step600.465 - Intro to NLP - J. Eisner 4Guess ofunknownparameters(probabilities)M stepObserved structure(words, ice cream)For Hidden Markov ModelsGuess of unknownhidden structure(tags, parses, weather)E stepinitialguess600.465 - Intro to NLP - J. Eisner 5Guess ofunknownparameters(probabilities)M stepObserved structure(words, ice cream)For Hidden Markov ModelsGuess of unknownhidden structure(tags, parses, weather)E stepinitialguess600.465 - Intro to NLP - J. Eisner 6Guess ofunknownparameters(probabilities)M stepObserved structure(words, ice cream)For Hidden Markov ModelsGuess of unknownhidden structure(tags, parses, weather)E stepinitialguess600.465 - Intro to NLP - J. Eisner 7Grammar ReestimationPARSERGrammarscorercorrect test treesaccuracyLEARNERtrainingtreestestsentences cheap, plentifuland appropriateexpensive and/orwrong sublanguageE stepM step600.465 - Intro to NLP - J. Eisner 8EM by Dynamic Programming:Two VersionsThe Viterbi approximationExpectation: pick the best parse of each sentenceMaximization: retrain on this best-parsed corpusAdvantage: Speed!Real EMExpectation: find all parses of each sentence Maximization: retrain on all parses in proportion to their probability (as if we observed fractional count)Advantage: p(training corpus) guaranteed to increaseExponentially many parses, so don’t extract them from chart – need some kind of clever countingwhy slower?600.465 - Intro to NLP - J. Eisner 9Examples of EMFinite-State case: Hidden Markov Models“forward-backward” or “Baum-Welch” algorithmApplications:explain ice cream in terms of underlying weather sequenceexplain words in terms of underlying tag sequenceexplain phoneme sequence in terms of underlying wordexplain sound sequence in terms of underlying phonemeContext-Free case: Probabilistic CFGs“inside-outside” algorithm: unsupervised grammar learning!Explain raw text in terms of underlying cx-free parseIn practice, local maximum problem gets in the wayBut can improve a good starting grammar via raw textClustering case: explain points via clusterscomposethese?600.465 - Intro to NLP - J. Eisner 10Our old friend PCFG SNPtimeVPVfliesPPPlikeNPDetanN arrowp(| S) = p(S NP VP | S) * p(NP time | NP)* p(VP V PP | VP) * p(V flies | V) * …Start with a “pretty good” grammarE.g., it was trained on supervised data (a treebank) that is small, imperfectly annotated, or has sentences in a different style from what you want to parse.Parse a corpus of unparsed sentences:Reestimate:Collect counts: …; c(S NP VP) += 12; c(S) += 2*12; …Divide: p(S NP VP | S) = c(S NP VP) / c(S)May be wise to smooth600.465 - Intro to NLP - J. Eisner 11Viterbi reestimation for parsing… 12 Today stocks were up…Todaywereupstocks SNPVPVPRTSAdvP12# copies of this sentencein the corpusSimilar, but now we consider all parses of each sentenceParse our corpus of unparsed sentences:Collect counts fractionally:…; c(S NP VP) += 10.8; c(S) += 2*10.8; ……; c(S NP VP) += 1.2; c(S) += 1*1.2; …But there may be exponentiallymany parses of a length-n sentence!How can we stay fast? Similar to taggings…600.465 - Intro to NLP - J. EisnerTrue EM for parsing… 12 Today stocks were up…Todaywereupstocks SNPVPVPRTSAdvP10.8# copies of this sentencein the corpusTodaywereupstocks NPVPVPRTSNPNP1.2600.465 - Intro to NLP - J. Eisner 13Analogies to , in PCFG?Day 1: 2 conesStartCp(C|Start)*p(2|C)0.5*0.2=0.1H0.5*0.2=0.1p(H|Start)*p(2|H)CHp(C|C)*p(3|C)0.8*0.1=0.08p(H|H)*p(3|H)0.8*0.7=0.560.1*0.7=0.07p(H|C)*p(3|H)=0.1*0.07+0.1*0.56=0.063p(C|H)*p(3|C)0.1*0.1=0.01=0.1*0.08+0.1*0.01=0.009Day 2: 3 cones=0.1=0.1CHp(C|C)*p(3|C)0.8*0.1=0.08p(H|H)*p(3|H)0.8*0.7=0.560.1*0.7=0.07p(H|C)*p(3|H)=0.009*0.07+0.063*0.56=0.03591p(C|H)*p(3|C)0.1*0.1=0.01=0.009*0.08+0.063*0.01=0.00135Day 3: 3 conesThe dynamic programming computation of . ( is similar but works back from Stop.)Call these H(2) and H(2) H(3) and H(3)600.465 - Intro to NLP - J. Eisner 14“Inside Probabilities” Sum over all VP parses of “flies like an arrow”:VP(1,5) = p(flies like an arrow | VP)Sum over all S parses of “time flies like an arrow”:S(0,5) = p(time flies like an arrow | S)SNPtimeVPVfliesPPPlikeNPDetanN arrowp(| S) = p(S NP VP | S) * p(NP time | NP)* p(VP V PP | VP) * p(V flies | V) * …600.465 - Intro to NLP - J. Eisner 15Compute Bottom-Up by CKY VP(1,5) = p(flies like an arrow | VP) = …S(0,5) = p(time flies like an arrow | S) = NP(0,1) * VP(1,5) * p(S NP VP|S) + …SNPtimeVPVfliesPPPlikeNPDetanN arrowp(| S) = p(S NP VP | S) * p(NP time | NP)* p(VP V PP | VP) * p(V flies | V) * …time 1 flies
View Full Document