MSU CSE 842 - Lecture 19: Maximum Entropy Mode - D2603064

Home> Schools> Michigan State University> Computer Science & Engineering (CSE) > CSE 842> Lecture 19: Maximum Entropy Mode

DOC PREVIEW

MSU CSE 842 - Lecture 19: Maximum Entropy Mode

School name Michigan State University

Course Cse 842-

Pages 10

This preview shows page 1-2-3 out of 10 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 10 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 10 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 10 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 10 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

3/28/2011 CSE842, Spring 2009, MSU 1CSE 842Natural Language ProcessingLecture 19: Maximum Entropy Model3/28/2011 CSE842, Spring 2009, MSU 2Regression versus Classification Mapping input features into some output value: - Regression: if the output is real-valued- Classification: when the output is one of a discrete set of classes3/28/2011 CSE842, Spring 2009, MSU 3Classification Given a set of classes: C =[c1, c2, …cn] and an observation x, the task is to identify which element from C the observation x belongs to. • Examples: – End-of-sentence boundaries– Email spam recognition – Sentiment analysis – Word sense disambiguation – Text classification 3/28/2011 CSE842, Spring 2009, MSU 4Supervised Learning • Given a set of pairs (x, y) where y is a label (or class) and xis an observation, discover a function that assigns the correct labels to the x.• Functions could be: – Rules– Decision trees– Probabilistic models –Etc.• What we have encountered so far: – Decision List– Naïve Bayes– Hidden Markov Model (sequence model)3/28/2011 CSE842, Spring 2009, MSU 5A Probabilistic Classifier • Predict a probability distribution over all classes for a given input pattern.• General problem:– An input domain X, – A finite class domain Y– The goal is to provide a conditional probability P(y | x) for any x, y where x ∈ X and y ∈ Y. •Today: – A brief introduction to logistic regression and Maximum Entropy Model3/28/2011 CSE842, Spring 2009, MSU 6Linear Regression 3/28/2011 CSE842, Spring 2009, MSU 7Linear Regression Multiple linear regression: Price = c+w1*Num_Adjective+w2*Mortgage+w3*Num_Unsold_H3/28/2011 CSE842, Spring 2009, MSU 8Linear Regression yXXXWyycfwfwyTTjobsjprediNiirrr12M0j)()(1)()(cost(W)−===−=+⋅=×=∑∑Learning in Linear Regression: Minimizing sum-squared errorsIt has a closed-form solution:3/28/2011 CSE842, Spring 2009, MSU 9Logistic Regression Model• The log-ratio of positive class to negative class•Results(1|)log(1|)py xxwcpy x==⋅+=−rrvr(1|)exp( )(1|) (1|)( 1|)1py xxwcpy xpy x py x==⋅+=−=+=−=rrvrrr1(1|)1 exp( )1 (|)11exp ( )(1|)1 exp( ) py xxw cpy xyxw cpy xxw c⎫=− =⎪+⋅+⎪⇒=⎬+−⋅+⎡⎤⎣⎦⎪==⎪+−⋅−⎭rrvrrvrrv3/28/2011 CSE842, Spring 2009, MSU 10Logistic Regression Model: Parameter Learning• Assume the inputs and outputs are related in the log linear function• Estimate weights: MLE approach• Convex optimization[]121(|;)1exp ( ){ , ,..., , }dpy xyxw cww w cθθ=+−⋅+=rrv{}[]*1,,1,,max()maxlog(|;)1max log1exp( )ntrain i iiwc wcniwcwc lD py xyxw cθ=====+−⋅+∑∑rrrrrrr12{ , ,..., , }dww w c3/28/2011 CSE842, Spring 2009, MSU 11Example: Heart Disease• Input feature x: age group id• output y: having heart disease or not• +1: having heart disease• -1: no heart disease1: 25-292: 30-343: 35-394: 40-445: 45-496: 50-547: 55-598: 60-64024681012345678Age groupNumber of PeopleNo heart DiseaseHeart disease3/28/2011 CSE842, Spring 2009, MSU 12Example: Heart Disease1(|)1exp ( ){,}py xyxw cwcθ=+− +⎡⎤⎣⎦=• Logistic regression model• Learning w and c: MLE approach• Numerical optimization: w = 0.58, c = -3.34{}[] []8181( ) ()log(|) ()log(|)11()log ()log1exp 1exptrain i iiiiilD n p i n p inniw c iw c===+++−−⎧⎫⎪⎪=+ +−⎨⎬+−− + +⎪⎪⎩⎭∑∑024681012345678Age groupNumber of PeopleNo heart DiseaseHeart disease3/28/2011 CSE842, Spring 2009, MSU 13Example: Heart Disease• W = 0.58• C = -3.34–xw+c< 0 Æ p(+|x) < p(-|x)–xw+c> 0 Æ p(+|x) > p(-|x)–xw+c= 0 Æ decision boundary• x* = 5.78 Æ 53 year old024681012345678Age groupNumber of PeopleNo heart DiseaseHeart disease[] []11(|;) ;(|;)1exp 1exppx pxxwc xwcθθ+= −=+−− + +3/28/2011 CSE842, Spring 2009, MSU 14Regularization• Solve over-fitting problem• Regularized log-likelihood• s||w||2is called the regularizer– Favors small weights– Prevents weights from becoming too large22() ()211 1()()log ( | ) log ( | )reg train trainNN miiiii ilD lD swpdpdsw+−+−== ==−=++−−∑∑ ∑r3/28/2011 CSE842, Spring 2009, MSU 15How to Extend Logistic Regression Model to Multiple Classes?•y∈{+1, -1} Æ{1,2,…,C}?121(|;)1exp ( ){, ,..., ,}mpy xyxw cww w cθθ=+−⋅+⎡⎤⎣⎦=rrv(1|)log(1|)py xxw cpy x==⋅+=−rrvr3/28/2011 CSE842, Spring 2009, MSU 16Conditional Exponential Model• Introduce a different set of parameters for each class• Ensure the sum of probability to be 1(|; ) exp( ) {,}yyyyypyxcxwcwθθ∝+⋅ =rrrr1(|;) exp( )()() exp( )yyyyypy x c xwZxZx c xwθ=+⋅=+⋅∑rrrrrrr(|;)pyxθr3/28/2011 CSE842, Spring 2009, MSU 17MaxEnt: A Simple Example• Consider a translation example• English ‘in’ Æ French {dans, en, à, au-cours-de, pendant}• Goal: p(dans), p(en), p(à), p(au-cours-de), p(pendant)• Case 1: no prior knowledge on tranlation– What is your guess of the probabilities?3/28/2011 CSE842, Spring 2009, MSU 18• Consider a translation example• English ‘in’ Æ French {dans, en, à, au cours de, pendant}• Goal: p(dans), p(en), p(à), p(au-cours-de), p(pendant)• Case 1: no prior knowledge on tranlation– What is your guess of the probabilities?– p(dans)=p(en)=p(à)=p(au-cours-de)=p(pendant)=1/5• Case 2: 30% of times either dans or en is usedMaxEnt: A Simple Example3/28/2011 CSE842, Spring 2009, MSU 19• Consider a translation example• English ‘in’ Æ French {dans, en, à, au cours de, pendant}• Goal: p(dans), p(en), p(à), p(au-cours-de), p(pendant)• Case 1: no prior knowledge on tranlation– What is your guess of the probabilities?– p(dans)=p(en)=p(à)=p(au-cours-de)=p(pendant)=1/5• Case 2: 30% of times either dans or en is used– What is your guess of the probabilities?– p(dans)=p(en)=3/20 p(à)=p(au-cours-de)=p(pendant)=7/30• Uniform distribution is favoredMaxEnt: A Simple Example3/28/2011 CSE842, Spring 2009, MSU 20• Case 3: 30% of time dans or en is used, and 50% of times dans or à is used– What is your guess of the probabilities?MaxEnt: A Simple Example3/28/2011 CSE842, Spring 2009, MSU 21• Case 3: 30% of time dans or en is used, and 50% of times dans or à is used– What is your guess of the probabilities?• A good probability distribution should– Satisfy the constraints– Be close to uniform distributionMaxEnt: A Simple Example3/28/2011 CSE842, Spring 2009, MSU 22Maximum Entropy (MaxEnt)• A uniformity of distribution is measured by entropy of the distribution• Solution:

View Full Document