Johns Hopkins EN 600 465 - The Expectation Maximization (EM) Algorithm - D1328012

Home> Schools> Johns Hopkins University> EN Computer Science (EN 600) > EN 600 465> The Expectation Maximization (EM) Algorithm

DOC PREVIEW

Johns Hopkins EN 600 465 - The Expectation Maximization (EM) Algorithm

School name Johns Hopkins University

Course En 600 465- Natural Language Processing

Pages 8

This preview shows page 1-2-3 out of 8 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 8 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 8 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 8 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 8 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

1 600.465 - Intro to NLP - J. Eisner 1 The Expectation Maximization (EM) Algorithm … continued! 600.465 - Intro to NLP - J. Eisner 2 Repeat until convergence! General Idea  Start by devising a noisy channel  Any model that predicts the corpus observations via some hidden structure (tags, parses, …)  Initially guess the parameters of the model!  Educated guess is best, but random can work  Expectation step: Use current parameters (and observations) to reconstruct hidden structure  Maximization step: Use that hidden structure (and observations) to reestimate parameters 600.465 - Intro to NLP - J. Eisner 3 Guess of unknown parameters (probabilities) initial guess M step Observed structure (words, ice cream) General Idea Guess of unknown hidden structure (tags, parses, weather) E step 600.465 - Intro to NLP - J. Eisner 4 Guess of unknown parameters (probabilities) M step Observed structure (words, ice cream) For Hidden Markov Models Guess of unknown hidden structure (tags, parses, weather) E step initial guess 600.465 - Intro to NLP - J. Eisner 5 Guess of unknown parameters (probabilities) M step Observed structure (words, ice cream) For Hidden Markov Models Guess of unknown hidden structure (tags, parses, weather) E step initial guess 600.465 - Intro to NLP - J. Eisner 6 Guess of unknown parameters (probabilities) M step Observed structure (words, ice cream) For Hidden Markov Models Guess of unknown hidden structure (tags, parses, weather) E step initial guess2 600.465 - Intro to NLP - J. Eisner 7 Grammar Reestimation P ARS E RGrammar s c o r e r correct test trees accuracy LEARNER training trees test sentences cheap, plentiful and appropriate expensive and/or wrong sublanguage E step M step 600.465 - Intro to NLP - J. Eisner 8 EM by Dynamic Programming: Two Versions  The Viterbi approximation  Expectation: pick the best parse of each sentence  Maximization: retrain on this best-parsed corpus Advantage: Speed!  Real EM  Expectation: find all parses of each sentence  Maximization: retrain on all parses in proportion to their probability (as if we observed fractional count) Advantage: p(training corpus) guaranteed to increase  Exponentially many parses, so don’t extract them from chart – need some kind of clever counting why slower? 600.465 - Intro to NLP - J. Eisner 9 Examples of EM  Finite-State case: Hidden Markov Models  “forward-backward” or “Baum-Welch” algorithm  Applications:  explain ice cream in terms of underlying weather sequence  explain words in terms of underlying tag sequence  explain phoneme sequence in terms of underlying word  explain sound sequence in terms of underlying phoneme  Context-Free case: Probabilistic CFGs  “inside-outside” algorithm: unsupervised grammar learning!  Explain raw text in terms of underlying cx-free parse  In practice, local maximum problem gets in the way  But can improve a good starting grammar via raw text  Clustering case: explain points via clusters compose these? 600.465 - Intro to NLP - J. Eisner 10 Our old friend PCFG S NP time VP V flies PP P like NP Det an N arrow p( | S) = p(S  NP VP | S) * p(NP  time | NP) * p(VP  V PP | VP) * p(V  flies | V) * …  Start with a “pretty good” grammar  E.g., it was trained on supervised data (a treebank) that is small, imperfectly annotated, or has sentences in a different style from what you want to parse.  Parse a corpus of unparsed sentences:  Reestimate:  Collect counts: …; c(S NP VP) += 12; c(S) += 2*12; …  Divide: p(S NP VP | S) = c(S NP VP) / c(S)  May be wise to smooth 600.465 - Intro to NLP - J. Eisner 11 Viterbi reestimation for parsing … 12 Today stocks were up … Today were up stocks S NP VP V PRT S AdvP 12 # copies of this sentence in the corpus  Similar, but now we consider all parses of each sentence  Parse our corpus of unparsed sentences:  Collect counts fractionally:  …; c(S NP VP) += 10.8; c(S) += 2*10.8; …  …; c(S NP VP) += 1.2; c(S) += 1*1.2; …  But there may be exponentially many parses of a length-n sentence!  How can we stay fast? Similar to taggings… 600.465 - Intro to NLP - J. Eisner True EM for parsing … 12 Today stocks were up … Today were up stocks S NP VP V PRT S AdvP 10.8 # copies of this sentence in the corpus Today were up stocks NP VP V PRT S NP NP 1.23 600.465 - Intro to NLP - J. Eisner 13 Analogies to ,  in PCFG? Day 1: 2 cones Start C p(C|Start)*p(2|C) 0.5*0.2=0.1 H 0.5*0.2=0.1 p(H|Start)*p(2|H) C H p(C|C)*p(3|C) 0.8*0.1=0.08 p(H|H)*p(3|H) 0.8*0.7=0.56 0.1*0.7=0.07 p(H|C)*p(3|H) =0.1*0.07+0.1*0.56 =0.063 p(C|H)*p(3|C) 0.1*0.1=0.01 =0.1*0.08+0.1*0.01 =0.009 Day 2: 3 cones =0.1 =0.1 C H p(C|C)*p(3|C) 0.8*0.1=0.08 p(H|H)*p(3|H) 0.8*0.7=0.56 0.1*0.7=0.07 p(H|C)*p(3|H) =0.009*0.07+0.063*0.56 =0.03591 p(C|H)*p(3|C) 0.1*0.1=0.01 =0.009*0.08+0.063*0.01 =0.00135 Day 3: 3 cones The dynamic programming computation of . ( is similar but works back from Stop.)Call these H(2) and H(2) H(3) and H(3) 600.465 - Intro to NLP - J. Eisner 14 “Inside Probabilities”  Sum over all VP parses of “flies like an arrow”: VP(1,5) = p(flies like an arrow | VP)  Sum over all S parses of “time flies like an arrow”: S(0,5) = p(time flies like an arrow | S) S NP time VP V flies PP P like NP Det an N arrow p( | S) = p(S  NP VP | S) * p(NP  time | NP) * p(VP  V PP | VP) * p(V  flies | V) * … 600.465 - Intro to NLP - J. Eisner 15 Compute  Bottom-Up by CKY VP(1,5) = p(flies like an arrow | VP) = … S(0,5) = p(time flies like an arrow | S) = NP(0,1) * VP(1,5) * p(S  NP VP|S) + … S NP time VP V flies PP P like NP Det an N arrow p( | S) = p(S  NP VP | S) * p(NP  time | NP) * p(VP  V PP | VP) * p(V  flies | V) * … time 1 flies 2 like 3 an 4 arrow 5 0NP 3 Vst 3 NP 10 S 8 S 13 NP 24 NP 24 S 22 S 27 S 27 1NP 4 VP 4 NP 18 S 21 VP 18 2P 2 V 5 PP 12 VP 16 3Det 1 NP 10 1 S  NP VP 6 S  Vst NP 2 S  S PP 1 VP  V NP 2 VP  VP PP 1 NP  Det N 2 NP  NP PP 3 NP  NP NP 0 PP  P NP Compute  Bottom-Up by CKY time 1 flies 2 like 3 an 4 arrow 5 0NP

View Full Document