Johns Hopkins EN 600 465 - LECTURE 34 Structured Prediction with Perceptrons and CRFs

Unformatted text preview:

Structured Prediction with Perceptrons and CRFsSlide 2PowerPoint PresentationSlide 4Slide 5Slide 6Slide 7Slide 8Slide 9The general problemRemember Weighted CKY … (find the minimum-weight parse)We used weighted CKY to implement probabilistic CKY for PCFGsCan set weights to log probsProbability is UsefulProbability is FlexibleAn Alternative TraditionScoring by Linear ModelsLinear model notationFinding the best y given xSlide 20Slide 21Slide 22When can you efficiently choose best y?Slide 24Slide 25Slide 26Slide 27Slide 28Slide 29Slide 30Nuthin’ but adding weightsSlide 32What if our weights were arbitrary real numbers?Slide 34Slide 35Slide 36Slide 37Slide 38Slide 39Slide 40Probabilists Rally Behind ParadigmProbabilists Regret Being Bound by PrincipleNews Flash! Hope arrives …Gradient-based trainingWhy Bother?News Flash! More hope …Maximum EntropySlide 48Slide 49Slide 50Slide 51Generalizing to More FeaturesWhat we just didOverfittingSolutions to Overfitting600.465 - Intro to NLP - J. Eisner 1Structured Prediction with Perceptrons and CRFsTime flies like an arrowNPPNPV P DNVPSTime flies like an arrowNVPNPN V DNSNPTime flies like an arrowVPPNPN P DNVPVPSTime flies like an arrowVNPV V DNVVSS…??600.465 - Intro to NLP - J. Eisner 2Structured Prediction with Perceptrons and CRFsBut now, modelstructures!Back to conditionallog-linear modeling …600.465 - Intro to NLP - J. Eisner 3600.465 - Intro to NLP - J. Eisner 4Reply today to claim your … Reply today to claim your …goodmail spamWanna get pizza tonight? Wanna get pizza tonight?goodmail spamThx; consider enlarging the … Thx; consider enlarging the …goodmail spamEnlarge your hidden … Enlarge your hidden …goodmail spam600.465 - Intro to NLP - J. Eisner 6…S  S  NP VPNP[+wh] V S/V/NPVP NPPP PS  S  N VPDet NS  S 600.465 - Intro to NLP - J. Eisner 7……S  S  NP VPNP[+wh] V S/V/NPVP NPPP PS  S  N VPDet NS  S  …NP  NP  NP VPNP CP/NPVP NPNP PPNP  NP  N VPDet NNP  NP 600.465 - Intro to NLP - J. Eisner 8Time flies like an arrowTime flies like an arrowTime flies like an arrowTime flies like an arrow…600.465 - Intro to NLP - J. Eisner 9Time flies like an arrowTime flies like an arrowTime flies like an arrowTime flies like an arrow…Structured predictionThe general problemGiven some input xOccasionally empty, e.g., no input needed for a generative n-gram or model of strings (randsent)Consider a set of candidate outputs yClassifications for x (small number: often just 2)Taggings of x (exponentially many)Parses of x (exponential, even infinite)Translations of x (exponential, even infinite)…Want to find the “best” y, given x600.465 - Intro to NLP - J. Eisner 1011Remember Weighted CKY …(find the minimum-weight parse)time 1 flies 2 like 3 an 4 arrow 50 NP 3Vst 3NP 10S 8NP 24S 221NP 4VP 4NP 18S 21VP 182P 2V 5PP 12VP 163 Det 1 NP 104 N 81 S  NP VP6 S  Vst NP2 S  S PP1 VP  V NP2 VP  VP PP1 NP  Det N2 NP  NP PP3 NP  NP NP0 PP  P NP12We used weighted CKY to implement probabilistic CKY for PCFGstime 1 flies 2 like 3 an 4 arrow 50 NP 3Vst 3NP 10S 8NP 24S 221NP 4VP 4NP 18S 21VP 182P 2V 5PP 12VP 163 Det 1 NP 104 N 81 S  NP VP6 S  Vst NP2 S  S PP1 VP  V NP2 VP  VP PP1 NP  Det N2 NP  NP PP3 NP  NP NP0 PP  P NP 2-82-122-2multiply to get 2-22But is weighted CKY good for anything else??13Can set weights to log probsSNPtimeVPVPfliesPPPlikeNPDetanN arroww(| S) = w(S  NP VP) + w(NP  time)+ w(VP  VP NP) + w(VP  flies) + …Just let w(X  Y Z) = -log p(X  Y Z | X)Then lightest tree has highest probBut is weighted CKY good for anything else??Do the weights have to be probabilities?600.465 - Intro to NLP - J. Eisner 14Probability is UsefulWe love probability distributions! We’ve learned how to define & use p(…) functions.Pick best output text T from a set of candidatesspeech recognition (HW2); machine translation; OCR; spell correction...maximize p1(T) for some appropriate distribution p1Pick best annotation T for a fixed input Itext categorization; parsing; part-of-speech tagging …maximize p(T | I); equivalently maximize joint probability p(I,T) often define p(I,T) by noisy channel: p(I,T) = p(T) * p(I | T) speech recognition & other tasks above are cases of this too:  we’re maximizing an appropriate p1(T) defined by p(T | I)Pick best probability distribution (a meta-problem!)really, pick best parameters : train HMM, PCFG, n-grams, clusters …maximum likelihood; smoothing; EM if unsupervised (incomplete data)Bayesian smoothing: max p(|data) = max p(, data) =p( )p(data|)summary of half of the course (statistics)600.465 - Intro to NLP - J. Eisner 15Probability is FlexibleWe love probability distributions! We’ve learned how to define & use p(…) functions.We want p(…) to define probability of linguistic objectsTrees of (non)terminals (PCFGs; CKY, Earley, pruning, inside-outside)Sequences of words, tags, morphemes, phonemes (n-grams, FSAs, FSTs; regex compilation, best-paths, forward-backward, collocations)Vectors (decis.lists, Gaussians, naïve Bayes; Yarowsky, clustering/k-NN)We’ve also seen some not-so-probabilistic stufSyntactic features, semantics, morph., Gold. Could be stochasticized?Methods can be quantitative & data-driven but not fully probabilistic: transf.-based learning, bottom-up clustering, LSA, competitive linkingBut probabilities have wormed their way into most thingsp(…) has to capture our intuitions about the ling. datasummary of other half of the course (linguistics)600.465 - Intro to NLP - J. Eisner 16An Alternative TraditionOld AI hacking technique:Possible parses (or whatever) have scores.Pick the one with the best score.How do you define the score?Completely ad hoc!Throw anything you want into the stewAdd a bonus for this, a penalty for that, etc.“Learns” over time – as you adjust bonuses and penalties by hand to improve performance. Total kludge, but totally flexible too …Can throw in any intuitions you might haveGiven some input xConsider a set of candidate outputs yDefine a scoring function score(x,y)Linear function: A sum of feature weights (you pick the features!)Choose y that maximizes


View Full Document

Johns Hopkins EN 600 465 - LECTURE 34 Structured Prediction with Perceptrons and CRFs

Download LECTURE 34 Structured Prediction with Perceptrons and CRFs
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view LECTURE 34 Structured Prediction with Perceptrons and CRFs and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view LECTURE 34 Structured Prediction with Perceptrons and CRFs 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?