Unformatted text preview:

Maxent Models and Discriminative EstimationGenerative vs. Discriminative modelsChristopher ManningChristopher ManningIntroduction• So far we’ve looked at “generative models”• Language models, Naive Bayes• But there is now much use of conditional or discriminative probabilistic models in NLP, Speech, IR (and ML generally)probabilistic models in NLP, Speech, IR (and ML generally)• Because:• They give high accuracy performance• They make it easy to incorporate lots of linguistically important features• They allow automatic building of language independent, retargetable NLP modulesChristopher ManningJoint vs. Conditional Models• We have some data {(d, c)} of paired observations d and hidden classes c.• Joint (generative) models place probabilities over both observed data and the hidden stuff (gene-P(c,d)both observed data and the hidden stuff (gene-rate the observed data from hidden stuff): • All the classic StatNLP models:• n-gram models, Naive Bayes classifiers, hidden Markov models, probabilistic context-free grammars, IBM machine translation alignment modelsP(c,d)Christopher ManningJoint vs. Conditional Models• Discriminative (conditional) models take the data as given, and put a probability over hidden structure given the data:•Logistic regression, conditional loglinearor maximum P(c|d)•Logistic regression, conditional loglinearor maximum entropy models, conditional random fields• Also, SVMs, (averaged) perceptron, etc. are discriminative classifiers (but not directly probabilistic)Christopher ManningBayes Net/Graphical Models• Bayes net diagrams draw circles for random variables, and lines for direct dependencies• Some variables are observed; some are hidden•Each node is a little classifier (conditional probability table) based on •Each node is a little classifier (conditional probability table) based on incoming arcscd1d2d3Naive Bayescd1d2d3GenerativeLogistic RegressionDiscriminativeChristopher ManningConditional vs. Joint Likelihood• A joint model gives probabilities P(d,c) and tries to maximize this joint likelihood.• It turns out to be trivial to choose weights: just relative frequencies.A conditionalmodel gives probabilities P(c|d). It takes the data •A conditionalmodel gives probabilities P(c|d). It takes the data as given and models only the conditional probability of the class.• We seek to maximize conditional likelihood.• Harder to do (as we’ll see…)• More closely related to classification error.Christopher ManningConditional models work well: Word Sense Disambiguation• Even with exactly the same features, changing from joint to conditional estimation increases performanceTraining SetObjective AccuracyJoint Like. 86.8Cond. Like.98.5performance• That is, we use the same smoothing, and the same word-class features, we just change the numbers (parameters) Cond. Like.98.5Test SetObjective AccuracyJoint Like. 73.6Cond. Like. 76.1(Klein and Manning 2002, using Senseval-1 Data)Maxent Models and Discriminative EstimationGenerative vs. Discriminative modelsChristopher ManningDiscriminative Model FeaturesMaking features from text for discriminative NLP modelsChristopher ManningChristopher ManningFeatures• In these slides and most maxent work: features f are elementary pieces of evidence that link aspects of what we observe d with a category c that we want to predictA feature is a function with a bounded real value•A feature is a function with a bounded real valueChristopher ManningExample features• f1(c, d) ≡ [c = LOCATION ∧∧∧∧ w-1 = “in” ∧∧∧∧ isCapitalized(w)]• f2(c, d) ≡ [c = LOCATION ∧∧∧∧ hasAccentedLatinChar(w)]• f3(c, d) ≡ [c = DRUG ∧∧∧∧ ends(w, “c”)]• Models will assign to each feature a weight:• A positive weight votes that this configuration is likely correct• A negative weight votes that this configuration is likely incorrectLOCATIONin QuébecPERSONsaw SueDRUGtaking ZantacLOCATIONin ArcadiaChristopher ManningFeature Expectations• We will crucially make use of two expectations• actual or predicted counts of a feature firing:•Empirical count (expectation) of a feature:•Empirical count (expectation) of a feature:• Model expectation of a feature:∑∈=),(observed),(),()( empiricalDCdciidcffE∑∈=),(),(),(),()(DCdciidcfdcPfEChristopher ManningFeatures• In NLP uses, usually a feature specifies1. an indicator function – a yes/no boolean matching function – of properties of the input and2.a particular class2.a particular classfi(c, d) ≡≡≡≡ [Φ(d) ∧∧∧∧ c = cj] [Value is 0 or 1]• Each feature picks out a data subset and suggests a label for itChristopher ManningFeature-Based Models• The decision about a data point is based only on the features active at that point.BUSINESS: Stocks Data… to restructure DataDT JJ NN …DataBUSINESS: Stocks hit a yearly low …Features{…, stocks, hit, a, yearly, low, …}Label: BUSINESSText Categorization… to restructure bank:MONEY debt.Features{…, w-1=restructure, w+1=debt, L=12, …}Label: MONEYWord-Sense DisambiguationDT JJ NN …The previous fall …Features{w=fall, t-1=JJ w-1=previous}Label: NNPOS TaggingChristopher ManningExample: Text Categorization(Zhang and Oles 2001)• Features are presence of each word in a document and the document class (they do feature selection to use reliable indicator words)•Tests on classic Reuters data set (and others)•Tests on classic Reuters data set (and others)• Naïve Bayes: 77.0% F1• Linear regression: 86.0%• Logistic regression: 86.4%• Support vector machine: 86.5%• Paper emphasizes the importance of regularization (smoothing) for successful use of discriminative methods (not used in much early NLP/IR work)Christopher ManningOther Maxent Classifier Examples• You can use a maxent classifier whenever you want to assign data points to one of a number of classes:• Sentence boundary detection (Mikheev 2000)•Is a period end of sentence or abbreviation?•Is a period end of sentence or abbreviation?• Sentiment analysis (Pang and Lee 2002)• Word unigrams, bigrams, POS counts, …• PP attachment (Ratnaparkhi 1998)• Attach to verb or noun? Features of head noun, preposition, etc.• Parsing decisions in general (Ratnaparkhi 1997; Johnson et al. 1999, etc.)Discriminative Model FeaturesMaking features from text for discriminative NLP modelsChristopher ManningFeature-based Linear ClassifiersHow to put


View Full Document

Stanford CS 124 - Study Notes

Download Study Notes
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Study Notes and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Study Notes 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?