Johns Hopkins EN 600 465 - The Expectation Maximization (EM) Algorithm

Unformatted text preview:

The Expectation Maximization (EM) AlgorithmGeneral IdeaSlide 3For Hidden Markov ModelsSlide 5Slide 6Grammar ReestimationEM by Dynamic Programming: Two VersionsExamples of EMOur old friend PCFGViterbi reestimation for parsingTrue EM for parsingAnalogies to a, b in PCFG?“Inside Probabilities”Compute  Bottom-Up by CKYSlide 16Slide 17Slide 18Slide 19Slide 20Compute  probs bottom-up (CKY)Inside & Outside ProbabilitiesSlide 23Slide 24Slide 25Slide 26Compute  probs bottom-up (gradually build up larger blue “inside” regions)Compute  probs top-down (uses  probs as well)Slide 29Details: Compute  probs bottom-upDetails: Compute  probs bottom-up (CKY)Details: Compute  probs top-down (reverse CKY)Details: Compute  probs top-down (reverse CKY)Slide 34What Inside-Outside is Good ForSlide 36Slide 37Slide 38Slide 39Slide 40Slide 41Slide 42Slide 43600.465 - Intro to NLP - J. Eisner 1The Expectation Maximization (EM) Algorithm … continued!600.465 - Intro to NLP - J. Eisner 2Repeat until convergence!General IdeaStart by devising a noisy channelAny model that predicts the corpus observations via some hidden structure (tags, parses, …)Initially guess the parameters of the model!Educated guess is best, but random can workExpectation step: Use current parameters (and observations) to reconstruct hidden structureMaximization step: Use that hidden structure (and observations) to reestimate parameters600.465 - Intro to NLP - J. Eisner 3Guess ofunknownparameters(probabilities)initialguessM stepObserved structure(words, ice cream)General IdeaGuess of unknownhidden structure(tags, parses, weather)E step600.465 - Intro to NLP - J. Eisner 4Guess ofunknownparameters(probabilities)M stepObserved structure(words, ice cream)For Hidden Markov ModelsGuess of unknownhidden structure(tags, parses, weather)E stepinitialguess600.465 - Intro to NLP - J. Eisner 5Guess ofunknownparameters(probabilities)M stepObserved structure(words, ice cream)For Hidden Markov ModelsGuess of unknownhidden structure(tags, parses, weather)E stepinitialguess600.465 - Intro to NLP - J. Eisner 6Guess ofunknownparameters(probabilities)M stepObserved structure(words, ice cream)For Hidden Markov ModelsGuess of unknownhidden structure(tags, parses, weather)E stepinitialguess600.465 - Intro to NLP - J. Eisner 7Grammar ReestimationPARSERGrammarscorercorrect test treesaccuracyLEARNERtrainingtreestestsentences cheap, plentifuland appropriateexpensive and/orwrong sublanguageE stepM step600.465 - Intro to NLP - J. Eisner 8EM by Dynamic Programming:Two VersionsThe Viterbi approximationExpectation: pick the best parse of each sentenceMaximization: retrain on this best-parsed corpusAdvantage: Speed!Real EMExpectation: find all parses of each sentence Maximization: retrain on all parses in proportion to their probability (as if we observed fractional count)Advantage: p(training corpus) guaranteed to increaseExponentially many parses, so don’t extract them from chart – need some kind of clever countingwhy slower?600.465 - Intro to NLP - J. Eisner 9Examples of EMFinite-State case: Hidden Markov Models“forward-backward” or “Baum-Welch” algorithmApplications:explain ice cream in terms of underlying weather sequenceexplain words in terms of underlying tag sequenceexplain phoneme sequence in terms of underlying wordexplain sound sequence in terms of underlying phonemeContext-Free case: Probabilistic CFGs“inside-outside” algorithm: unsupervised grammar learning!Explain raw text in terms of underlying cx-free parseIn practice, local maximum problem gets in the wayBut can improve a good starting grammar via raw textClustering case: explain points via clusterscomposethese?600.465 - Intro to NLP - J. Eisner 10Our old friend PCFG SNPtimeVPVfliesPPPlikeNPDetanN arrowp(| S) = p(S  NP VP | S) * p(NP  time | NP)* p(VP  V PP | VP) * p(V  flies | V) * …Start with a “pretty good” grammarE.g., it was trained on supervised data (a treebank) that is small, imperfectly annotated, or has sentences in a different style from what you want to parse.Parse a corpus of unparsed sentences:Reestimate:Collect counts: …; c(S  NP VP) += 12; c(S) += 2*12; …Divide: p(S  NP VP | S) = c(S  NP VP) / c(S)May be wise to smooth600.465 - Intro to NLP - J. Eisner 11Viterbi reestimation for parsing… 12 Today stocks were up…Todaywereupstocks SNPVPVPRTSAdvP12# copies of this sentencein the corpusSimilar, but now we consider all parses of each sentenceParse our corpus of unparsed sentences:Collect counts fractionally:…; c(S  NP VP) += 10.8; c(S) += 2*10.8; ……; c(S  NP VP) += 1.2; c(S) += 1*1.2; …But there may be exponentiallymany parses of a length-n sentence!How can we stay fast? Similar to taggings…600.465 - Intro to NLP - J. EisnerTrue EM for parsing… 12 Today stocks were up…Todaywereupstocks SNPVPVPRTSAdvP10.8# copies of this sentencein the corpusTodaywereupstocks NPVPVPRTSNPNP1.2600.465 - Intro to NLP - J. Eisner 13Analogies to ,  in PCFG?Day 1: 2 conesStartCp(C|Start)*p(2|C)0.5*0.2=0.1H0.5*0.2=0.1p(H|Start)*p(2|H)CHp(C|C)*p(3|C)0.8*0.1=0.08p(H|H)*p(3|H)0.8*0.7=0.560.1*0.7=0.07p(H|C)*p(3|H)=0.1*0.07+0.1*0.56=0.063p(C|H)*p(3|C)0.1*0.1=0.01=0.1*0.08+0.1*0.01=0.009Day 2: 3 cones=0.1=0.1CHp(C|C)*p(3|C)0.8*0.1=0.08p(H|H)*p(3|H)0.8*0.7=0.560.1*0.7=0.07p(H|C)*p(3|H)=0.009*0.07+0.063*0.56=0.03591p(C|H)*p(3|C)0.1*0.1=0.01=0.009*0.08+0.063*0.01=0.00135Day 3: 3 conesThe dynamic programming computation of . ( is similar but works back from Stop.)Call these H(2) and H(2) H(3) and H(3)600.465 - Intro to NLP - J. Eisner 14“Inside Probabilities” Sum over all VP parses of “flies like an arrow”:VP(1,5) = p(flies like an arrow | VP)Sum over all S parses of “time flies like an arrow”:S(0,5) = p(time flies like an arrow | S)SNPtimeVPVfliesPPPlikeNPDetanN arrowp(| S) = p(S  NP VP | S) * p(NP  time | NP)* p(VP  V PP | VP) * p(V  flies | V) * …600.465 - Intro to NLP - J. Eisner 15Compute  Bottom-Up by CKY VP(1,5) = p(flies like an arrow | VP) = …S(0,5) = p(time flies like an arrow | S) = NP(0,1) * VP(1,5) * p(S  NP VP|S) + …SNPtimeVPVfliesPPPlikeNPDetanN arrowp(| S) = p(S  NP VP | S) * p(NP  time | NP)* p(VP  V PP | VP) * p(V  flies | V) * …time 1 flies


View Full Document

Johns Hopkins EN 600 465 - The Expectation Maximization (EM) Algorithm

Download The Expectation Maximization (EM) Algorithm
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view The Expectation Maximization (EM) Algorithm and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view The Expectation Maximization (EM) Algorithm 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?