(Conditional) Log-Linear ModelingProbability is UsefulProbability is FlexibleAn Alternative TraditionSlide 5Nuthin’ but adding weightsSlide 7What if our weights were arbitrary real numbers?Slide 9Slide 10Slide 11The general problemFinding the best y given xSlide 14Slide 15Linear model notationSlide 17Probabilists Rally Behind ParadigmProbabilists Regret Being Bound by PrincipleSlide 20Slide 21Revolution Corrupted by Bourgeois ValuesSlide 23Right way to convert score to a probability (score is a log-prob, up to a constant)Training Gradient-based trainingRenormalize by 1/Z to get a Log-Linear ModelWhy Bother?Attempt to Cancel out ZSo: Modify Setup a BitNow we can cancel out ZMaximum EntropySlide 33Slide 34Slide 35Slide 36Generalizing to More FeaturesWhat we just didOverfittingSolutions to Overfitting600.465 - Intro to NLP - J. Eisner 1(Conditional) Log-Linear Modeling600.465 - Intro to NLP - J. Eisner 2Probability is UsefulWe love probability distributions! We’ve learned how to define & use p(…) functions.Pick best output text T from a set of candidatesspeech recognition (HW2); machine translation; OCR; spell correction...maximize p1(T) for some appropriate distribution p1Pick best annotation T for a fixed input Itext categorization; parsing; part-of-speech tagging …maximize p(T | I); equivalently maximize joint probability p(I,T) often define p(I,T) by noisy channel: p(I,T) = p(T) * p(I | T) speech recognition & other tasks above are cases of this too: we’re maximizing an appropriate p1(T) defined by p(T | I)Pick best probability distribution (a meta-problem!)really, pick best parameters : train HMM, PCFG, n-grams, clusters …maximum likelihood; smoothing; EM if unsupervised (incomplete data)Bayesian smoothing: max p(|data) = max p(, data) =p()p(data|)summary of half of the course (statistics)600.465 - Intro to NLP - J. Eisner 3Probability is FlexibleWe love probability distributions! We’ve learned how to define & use p(…) functions.We want p(…) to define probability of linguistic objectsTrees of (non)terminals (PCFGs; CKY, Earley, pruning, inside-outside)Sequences of words, tags, morphemes, phonemes (n-grams, FSAs, FSTs; regex compilation, best-paths, forward-backward, collocations)Vectors (decis.lists, Gaussians, naïve Bayes; Yarowsky, clustering/k-NN)We’ve also seen some not-so-probabilistic stufSyntactic features, semantics, morph., Gold. Could be stochasticized?Methods can be quantitative & data-driven but not fully probabilistic: transf.-based learning, bottom-up clustering, LSA, competitive linkingBut probabilities have wormed their way into most thingsp(…) has to capture our intuitions about the ling. datasummary of other half of the course (linguistics)600.465 - Intro to NLP - J. Eisner 4An Alternative TraditionOld AI hacking technique:Possible parses (or whatever) have scores.Pick the one with the best score.How do you define the score?Completely ad hoc!Throw anything you want into the stewAdd a bonus for this, a penalty for that, etc.“Learns” over time – as you adjust bonuses and penalties by hand to improve performance. Total kludge, but totally flexible too …Can throw in any intuitions you might havereally so alternative?600.465 - Intro to NLP - J. Eisner 5An Alternative TraditionOld AI hacking technique:Possible parses (or whatever) have scores.Pick the one with the best score.How do you define the score?Completely ad hoc!Throw anything you want into the stewAdd a bonus for this, a penalty for that, etc.“Learns” over time – as you adjust bonuses and penalties by hand to improve performance. Total kludge, but totally flexible too …Can throw in any intuitions you might havereally so alternative?Exposé at 9Probabilistic RevolutionNot Really a Revolution, Critics SayLog-probabilities no more than scores in disguise“We’re just adding stuff up like the old corrupt regime did,” admits spokesperson600.465 - Intro to NLP - J. Eisner 6Nuthin’ but adding weightsn-grams: … + log p(w7 | w5, w6) + log p(w8 | w6, w7) + …PCFG: log p(NP VP | S) + log p(Papa | NP) + log p(VP PP | VP) …HMM tagging: … + log p(t7 | t5, t6) + log p(w7 | t7) + …Noisy channel: [log p(source)] + [log p(data | source)]Cascade of FSTs: [log p(A)] + [log p(B | A)] + [log p(C | B)] + …Naïve Bayes: log p(Class) + log p(feature1 | Class) + log p(feature2 | Class) …Note: Today we’ll use +logprob not –logprob:i.e., bigger weights are better.600.465 - Intro to NLP - J. Eisner 7Nuthin’ but adding weightsn-grams: … + log p(w7 | w5, w6) + log p(w8 | w6, w7) + …PCFG: log p(NP VP | S) + log p(Papa | NP) + log p(VP PP | VP) …Can describe any linguistic object as collection of “features”(here, a tree’s “features” are all of its component rules)(different meaning of “features” from singular/plural/etc.)Weight of the object = total weight of featuresOur weights have always been conditional log-probs ( 0)but what if we changed that?HMM tagging: … + log p(t7 | t5, t6) + log p(w7 | t7) + …Noisy channel: [log p(source)] + [log p(data | source)]Cascade of FSTs: [log p(A)] + [log p(B | A)] + [log p(C | B)] + …Naïve Bayes: log(Class) + log(feature1 | Class) + log(feature2 | Class) + …Change log p(this | that) to (this; that)600.465 - Intro to NLP - J. Eisner 8What if our weights were arbitrary real numbers?n-grams: … + log p(w7 | w5, w6) + log p(w8 | w6, w7) + …PCFG: log p(NP VP | S) + log p(Papa | NP) + log p(VP PP | VP) …HMM tagging: … + log p(t7 | t5, t6) + log p(w7 | t7) + …Noisy channel: [log p(source)] + [log p(data | source)]Cascade of FSTs: [log p(A)] + [log p(B | A)] + [log p(C | B)] + …Naïve Bayes: log p(Class) + log p(feature1 | Class) + log p(feature2 | Class) …Change log p(this | that) to (this ; that)600.465 - Intro to NLP - J. Eisner 9What if our weights were arbitrary real numbers?n-grams: … + (w7 ; w5, w6) + (w8 ; w6, w7) + …PCFG: (NP VP ; S) + (Papa ; NP) + (VP PP ; VP) …HMM tagging: … + (t7 ; t5, t6) + (w7 ; t7) + …Noisy channel: [ (source)] + [ (data ; source)]Cascade of FSTs: [ (A)] + [ (B ; A)] + [ (C ; B)] + …Naïve Bayes: (Class) + (feature1 ; Class)
View Full Document