MIT 6 863J - The Red Pill or the Blue Pill - D129194

Home> Schools> Massachusetts Institute of Technology> Electrical Engineering and Computer Science (6) > 6 863J> The Red Pill or the Blue Pill

DOC PREVIEW

MIT 6 863J - The Red Pill or the Blue Pill

School name Massachusetts Institute of Technology

Course 6 863j- Natural Language and the Computer Representation of Knowledge

Pages 41

This preview shows page 1-2-3-19-20-39-40-41 out of 41 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 41 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 41 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 41 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 41 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 41 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 41 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 41 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 41 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 41 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

6.863J Natural Language ProcessingLecture 7: The Red Pill or the Blue Pill, Episode 2: part-of-speech taggingInstructor: Robert C. [email protected]/9.611J SP04 Lecture 7The Menu Bar• Administrivia:• Schedule alert: Lab1b due today• Lab 2b released, this Weds (later today) Agenda:Red vs. Blue:• Part of speech ‘tagging’ via statistical models• Part of speech tagging via rules• Ch. 6 & 8 in Jurafsky6.863J/9.611J SP04 Lecture 7The Great Divide in NLP: the red pill or the blue pill?“KnowledgeEngineering” approachRules built by hand w/K of Language“Text understanding”“Trainable Statistical”ApproachRules inferred from lotsof data (“corpora”)“Information retrieval”6.863J/9.611J SP04 Lecture 7Two approaches1. Statistical model 2. Deterministic baseline tagger composed with a cascade of fixup transducersThese two approaches are the guts of Lab 2(lots of others methods: decision trees, …)6.863J/9.611J SP04 Lecture 7The problem• In unseen data,we wish to find the part of speech tags• The set of part of speech tags are decided by experts6.863J/9.611J SP04 Lecture 7Noishy Chunnel Muddle (statistical)noisy channel X Æ Yreal language Xyucky language Ywant to recover X from Ypart-of-speech tagsinsert wordstext6.863J/9.611J SP04 Lecture 7A picture: the statistical, noisy channel view x(speech)Wreck a nice beach?Reckon eyes peach?Recognize speech?Acoustic ModelP(x|y)LanguageModelP(y)y(text)Bigram Tag modelP(T)Word modelP(w|T)x(words) y(tags)6.863J/9.611J SP04 Lecture 7Formulation, in generalargmax Pr(|)LabelLabel Label Data=6.863J/9.611J SP04 Lecture 7General probabilistic decision problem• E.g.: data = bunch of text• label = language• label = topic• label = author• E.g.2: (sequential prediction)• label = translation or summary of entire text• label = part of speech of current word• label = identity of current word (ASR) or character (OCR)6.863J/9.611J SP04 Lecture 7Language models – statistical view• Application to speech recognition (and parsing, generally)• x= Input (speech/words)• y= output (text/Tags)• We want to find max P(y|x) Problem: we don’t know the tags – that is what we want to find!• Solution: We have an estimate of P(y) [the language model] and P(x|y) [the prob. of some sound/words given text/Tags = an acoustic model or Tag model]6.863J/9.611J SP04 Lecture 7Finding inner form given outside:From Bayes’ law, we have, max P(y|x) = max P(x|y) • P(y) = max Pr acoustic model x lang model(hold P(x) fixed, i.e., P(x|y) • P(y) / P(x), but max is same for both)So, in tagging case, we have a wordmodel instead - so we find max P(tags|w) from: max P(words|tags) • P(tags)6.863J/9.611J SP04 Lecture 7HMM for POS tagging• In a Hidden Markov model, it is hypothesized that there is an underlying finite state machine (not directly observable, hence hidden) that changes state with each input element• For us, the hidden states are the tags, and the input elements are the words6.863J/9.611J SP04 Lecture 7Hidden Markov Model tagging for POS• Prob(Tag sequence) – based on n-grams: train on marked up, tagged text• Prob(W|T) – unigram prob, based on tagged text• Prob(T, w) computed from Viterbi trellis computation - max over all possible tag sequence paths, and ‘emission’ probabilities of word, tag combination• Unseen tag sequence6.863J/9.611J SP04 Lecture 7Cartoon form ReviewTag sequence bigrams: P(T)Unigram: p(W | T)p(T, w)*==*score candidate tag seqson their joint probability with observed words;we should pick best paththe cool directed autosAdj:cortege/0.000001…Noun:Bill/0.002Noun:autos/0.001…Noun:cortege/0.000001Adj:cool/0.003Adj:directed/0.0005Det:the/0.4Det:a/0.6DetStartAdjNounVerbPrepStopNoun0.7Adj 0.3Adj 0.4ε 0.1Noun0.5Det 0.8ε 0.2*6.863J/9.611J SP04 Lecture 7HMM construction• Hidden state transition model governs observed word sequences• Transitions probabilistic• Estimate transition probabilities from an annotated corpus state ‘s’ = tag state • P(sj| sj-1, wj) • Based just on prior state and current word seen (hence Markovian assumption)• At runtime, find maximum likelihood path through the network, using a max-flow algorithm (Viterbi)6.863J/9.611J SP04 Lecture 7Cartoon form ReviewTag sequence bigrams: P(T)Unigram: p(W | T)p(T, w)*==**p(w | W)transducer: scores candidate tag seqson their joint probability with obs words;we should pick best paththe cool directed autosAdj:cortege/0.000001…Noun:Bill/0.002Noun:autos/0.001…Noun:cortege/0.000001Adj:cool/0.003Adj:directed/0.0005Det:the/0.4Det:a/0.6DetStartAdjNounVerbPrepStopNoun0.7Adj 0.3Adj 0.4ε 0.1Noun0.5Det 0.8ε 0.2*6.863J/9.611J SP04 Lecture 7P(T) - Tag bigram pictureDetBOSAdjNounEOSAdj 0.3Adj 0.4Noun0.5ε 0.2Det 0.8p(tag seq)BOS Det Adj Adj Noun EOS = 0.8 * 0.3 * 0.4 * 0.5 * 0.26.863J/9.611J SP04 Lecture 7Unigram replacement modelNoun:Bill/0.002Noun:autos/0.001…Noun:cortege/0.000001Adj:cool/0.003Adj:directed/0.0005Adj:cortege/0.000001…Det:the/0.4Det:a/0.6sums to 1sums to 1P(word| tag)6.863J/9.611J SP04 Lecture 7Compose withactual word seqDet:a 0.48Det:the 0.32DetBOSAdjNounEOSAdj:cool 0.0009Adj:directed 0.00015Adj:cortege 0.000003p(word seq, tag seq) = p(tag seq) * p(word seq | tag seq)Adj:cool 0.0012Adj:directed 0.00020Adj:cortege 0.000004N:cortegeN:autos0.00002theDet:the 0.320.32 xD:the # 0.2cool.0009 xA:coolAdj:cool 0.0009directed.0002 xA:directedAdj:directed 0.00020# 0.2x.2 ≈ .3 10-6 totalpath prob, done! #autos.00002 xN:autosN:autos6.863J/9.611J SP04 Lecture 7Unroll the fsa - All paths together form ‘trellis’Det:the 0.32DetBOS AdjNounStopp(word seq, tag seq)DetAdjNounDetAdjNounDetAdjNounAdj:directed…Noun:autos…ε0.2Adj:directed…The best path:BOS Det Adj Adj Noun EOS = 0.32 * 0.0009 …the cool directed autosAdj:cool 0.0009Noun:cool 0.007WHY?6.863J/9.611J SP04 Lecture 7Cross-product construction forms trellisSo all paths here must have 5 words on output sideAll paths here are 5 words0,01,12,13,11,22,23,21,32,33,31,42,43,44,40 1 2 3 4=*0 1234εεεεεε6.863J/9.611J SP04 Lecture 7Finding the best path from start to stop• Use dynamic programming • What is best path from Start to eachnode?• Work from left to right• Each node stores its best path from Start (as probability plus one backpointer)• Special acyclic case of Dijkstra’s shortest-path algorithm•Faster if some arcs/states are absentDet:the 0.32DetStart

View Full Document