Johns Hopkins EN 600 465 - (Conditional) Log-Linear Modeling - D2037707

Home> Schools> Johns Hopkins University> EN Computer Science (EN 600) > EN 600 465> (Conditional) Log-Linear Modeling

DOC PREVIEW

Johns Hopkins EN 600 465 - (Conditional) Log-Linear Modeling

School name Johns Hopkins University

Course En 600 465- Natural Language Processing

Pages 40

This preview shows page 1-2-3-19-20-38-39-40 out of 40 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 40 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 40 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 40 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 40 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 40 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 40 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 40 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 40 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 40 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

(Conditional) Log-Linear ModelingProbability is UsefulProbability is FlexibleAn Alternative TraditionSlide 5Nuthin’ but adding weightsSlide 7What if our weights were arbitrary real numbers?Slide 9Slide 10Slide 11The general problemFinding the best y given xSlide 14Slide 15Linear model notationSlide 17Probabilists Rally Behind ParadigmProbabilists Regret Being Bound by PrincipleSlide 20Slide 21Revolution Corrupted by Bourgeois ValuesSlide 23Right way to convert score to a probability (score is a log-prob, up to a constant)Training Gradient-based trainingRenormalize by 1/Z to get a Log-Linear ModelWhy Bother?Attempt to Cancel out ZSo: Modify Setup a BitNow we can cancel out ZMaximum EntropySlide 33Slide 34Slide 35Slide 36Generalizing to More FeaturesWhat we just didOverfittingSolutions to Overfitting600.465 - Intro to NLP - J. Eisner 1(Conditional) Log-Linear Modeling600.465 - Intro to NLP - J. Eisner 2Probability is UsefulWe love probability distributions! We’ve learned how to define & use p(…) functions.Pick best output text T from a set of candidatesspeech recognition (HW2); machine translation; OCR; spell correction...maximize p1(T) for some appropriate distribution p1Pick best annotation T for a fixed input Itext categorization; parsing; part-of-speech tagging …maximize p(T | I); equivalently maximize joint probability p(I,T) often define p(I,T) by noisy channel: p(I,T) = p(T) * p(I | T) speech recognition & other tasks above are cases of this too:  we’re maximizing an appropriate p1(T) defined by p(T | I)Pick best probability distribution (a meta-problem!)really, pick best parameters : train HMM, PCFG, n-grams, clusters …maximum likelihood; smoothing; EM if unsupervised (incomplete data)Bayesian smoothing: max p(|data) = max p(, data) =p()p(data|)summary of half of the course (statistics)600.465 - Intro to NLP - J. Eisner 3Probability is FlexibleWe love probability distributions! We’ve learned how to define & use p(…) functions.We want p(…) to define probability of linguistic objectsTrees of (non)terminals (PCFGs; CKY, Earley, pruning, inside-outside)Sequences of words, tags, morphemes, phonemes (n-grams, FSAs, FSTs; regex compilation, best-paths, forward-backward, collocations)Vectors (decis.lists, Gaussians, naïve Bayes; Yarowsky, clustering/k-NN)We’ve also seen some not-so-probabilistic stufSyntactic features, semantics, morph., Gold. Could be stochasticized?Methods can be quantitative & data-driven but not fully probabilistic: transf.-based learning, bottom-up clustering, LSA, competitive linkingBut probabilities have wormed their way into most thingsp(…) has to capture our intuitions about the ling. datasummary of other half of the course (linguistics)600.465 - Intro to NLP - J. Eisner 4An Alternative TraditionOld AI hacking technique:Possible parses (or whatever) have scores.Pick the one with the best score.How do you define the score?Completely ad hoc!Throw anything you want into the stewAdd a bonus for this, a penalty for that, etc.“Learns” over time – as you adjust bonuses and penalties by hand to improve performance. Total kludge, but totally flexible too …Can throw in any intuitions you might havereally so alternative?600.465 - Intro to NLP - J. Eisner 5An Alternative TraditionOld AI hacking technique:Possible parses (or whatever) have scores.Pick the one with the best score.How do you define the score?Completely ad hoc!Throw anything you want into the stewAdd a bonus for this, a penalty for that, etc.“Learns” over time – as you adjust bonuses and penalties by hand to improve performance. Total kludge, but totally flexible too …Can throw in any intuitions you might havereally so alternative?Exposé at 9Probabilistic RevolutionNot Really a Revolution, Critics SayLog-probabilities no more than scores in disguise“We’re just adding stuff up like the old corrupt regime did,” admits spokesperson600.465 - Intro to NLP - J. Eisner 6Nuthin’ but adding weightsn-grams: … + log p(w7 | w5, w6) + log p(w8 | w6, w7) + …PCFG: log p(NP VP | S) + log p(Papa | NP) + log p(VP PP | VP) …HMM tagging: … + log p(t7 | t5, t6) + log p(w7 | t7) + …Noisy channel: [log p(source)] + [log p(data | source)]Cascade of FSTs: [log p(A)] + [log p(B | A)] + [log p(C | B)] + …Naïve Bayes: log p(Class) + log p(feature1 | Class) + log p(feature2 | Class) …Note: Today we’ll use +logprob not –logprob:i.e., bigger weights are better.600.465 - Intro to NLP - J. Eisner 7Nuthin’ but adding weightsn-grams: … + log p(w7 | w5, w6) + log p(w8 | w6, w7) + …PCFG: log p(NP VP | S) + log p(Papa | NP) + log p(VP PP | VP) …Can describe any linguistic object as collection of “features”(here, a tree’s “features” are all of its component rules)(different meaning of “features” from singular/plural/etc.)Weight of the object = total weight of featuresOur weights have always been conditional log-probs ( 0)but what if we changed that?HMM tagging: … + log p(t7 | t5, t6) + log p(w7 | t7) + …Noisy channel: [log p(source)] + [log p(data | source)]Cascade of FSTs: [log p(A)] + [log p(B | A)] + [log p(C | B)] + …Naïve Bayes: log(Class) + log(feature1 | Class) + log(feature2 | Class) + …Change log p(this | that) to  (this; that)600.465 - Intro to NLP - J. Eisner 8What if our weights were arbitrary real numbers?n-grams: … + log p(w7 | w5, w6) + log p(w8 | w6, w7) + …PCFG: log p(NP VP | S) + log p(Papa | NP) + log p(VP PP | VP) …HMM tagging: … + log p(t7 | t5, t6) + log p(w7 | t7) + …Noisy channel: [log p(source)] + [log p(data | source)]Cascade of FSTs: [log p(A)] + [log p(B | A)] + [log p(C | B)] + …Naïve Bayes: log p(Class) + log p(feature1 | Class) + log p(feature2 | Class) …Change log p(this | that) to  (this ; that)600.465 - Intro to NLP - J. Eisner 9What if our weights were arbitrary real numbers?n-grams: … + (w7 ; w5, w6) + (w8 ; w6, w7) + …PCFG: (NP VP ; S) + (Papa ; NP) + (VP PP ; VP) …HMM tagging: … + (t7 ; t5, t6) + (w7 ; t7) + …Noisy channel: [ (source)] + [ (data ; source)]Cascade of FSTs: [ (A)] + [ (B ; A)] + [ (C ; B)] + …Naïve Bayes: (Class) + (feature1 ; Class)

View Full Document