Machine Learning 1010 701 15701 15 781 Spring 2008 Logistic Regression generative verses discriminative classifier Eric Xing Lecture 5 January 30 2006 Reading Chap 3 1 3 4 CB Generative vs Discriminative Classifiers z Goal Wish to learn f X Y e g P Y X z Generative classifiers e g Na ve Bayes z Assume some functional form for P X Y P Y This is a generative model of the data z z Estimate parameters of P X Y P Y directly from training data z Use Bayes rule to calculate P Y X xi Discriminative classifiers z Directly assume some functional form for P Y X This is a discriminative model of the data z Estimate parameters of P Y X directly from training data Yi Xi Yi Xi 1 Consider the a Gaussian Generative Classifier z z learning f X Y where z X is a vector of real valued features X1 Xn z Y is boolean Yi What does that imply about the form of P Y X z The joint probability of a datum and its label is Xi p xn ynk 1 p ynk 1 p xn ynk 1 k z 1 exp 2 1 2 xn k 2 2 2 1 2 Given a datum xn we predict its label using the conditional probability of the label given the datum 1 exp 2 1 2 xn k 2 2 2 1 2 p y 1 xn 1 k k 2 2 1 2 exp 2 1 2 xn k 2 k k n Na ve Bayes Classifier z When X is multivariate Gaussian vector z Yi The joint probability of a datum and it label is r r r r p xn ynk 1 p ynk 1 p xn ynk 1 Xi r r r r 1 k exp 21 xn k T 1 xn k 2 1 2 z The na ve Bayes simplification Yi p xn ynk 1 p ynk 1 p xnj ynk 1 k j k j j k j 1 2 k2 j exp 2 12 xnj k j 2 1 2 k j Xi1 Xi2 Xim m z More generally p xn yn p yn p xnj yn j 1 z Where p is an arbitrary conditional discrete or continuous 1 D density 2 The predictive distribution z z Understanding the predictive distribution v p ynk 1 xn v p ynk 1 xn v p xn k N xn k k k k N xn k k Under na ve Bayes assumption 1 xnj kj 2 log k j C 2 2 k j p y 1 x n 1 j j 2 k k exp j 2 2 xn k log k j C k j z k exp j v k n For two class i e K 2 and when the two classes haves the same variance turns out to be the logistic function p yn1 1 xn 1 1 e 1 1 2 exp 1 exp j 2 1 2j xnj 2j 2 log j C j 2 1 2j xnj 1j 2 log j C 1 1 exp j xnj 1 2j 1j 2j 12 1j 2 2j 2 log 1 1 1 j T xn The decision boundary z The predictive distribution p yn1 1 xn z 1 1 T M 1 e xn j 1 exp j xn 0 j 1 The Bayes decision rule 1 T xn p yn1 1 xn e 1 ln ln T p yn2 1 xn e xn T xn 1 e z Tx n For multiple class i e K 2 correspond to a softmax function e k xn T p ynk 1 xn e Tj xn j 3 Discussion Generative and discriminative classifiers z Generative z Modeling the joint distribution of all data z Discriminative z Modeling only points at the boundary z How Regression Linear regression z The data x1 y1 x2 y2 x3 y3 L xN y N z Both nodes are observed z X is an input vector z Y is a response vector Yi Xi N we first consider y as a generic continuous response vector then we consider the special case of classification where y is a discrete indicator z A regression scheme can be used to model p y x directly rather than p x y 4 Logistic regression sigmoid classifier z The condition distribution a Bernoulli p y x x y 1 x 1 y where is a logistic function x 1 1 e T x z We can used the brute force gradient method as in LR z But we can also apply generic laws by observing the p y x is an exponential family function more specifically a generalized linear model see next lecture Training Logistic Regression MCLE z Estimate parameters 0 1 n to maximize the conditional likelihood of training data z Training data z Data likelihood z Data conditional likelihood 5 Expressing Conditional Log Likelihood z Recall the logistic function and conditional likelihood Maximizing Conditional Log Likelihood z The objective z Good news l is concave function of z Bad news no closed form solution to maximize l 6 Gradient Ascent z Property of sigmoid function z The gradient The gradient ascent algorithm iterate until change For all i repeat How about MAP z It is very common to use regularized maximum likelihood z One common approach is to define priors on z Normal distribution zero mean identity covariance z Helps avoid very large weights and overfitting z MAP estimate 7 MLE vs MAP z Maximum conditional likelihood estimate z Maximum a posteriori estimate Logistic regression practical issues z z z z IRLS takes O Nd3 per iteration where N number of training cases and d dimension of input x Quasi Newton methods that approximate the Hessian work faster Conjugate gradient takes O Nd per iteration and usually works best in practice Stochastic gradient descent can also be used if N is large c f perceptron rule 8 Na ve Bayes vs Logistic Regression z Consider Y boolean X continuous X X1 Xn z Number of parameters to estimate NB LR z Estimation method z NB parameter estimates are uncoupled z LR parameter estimates are coupled Na ve Bayes vs Logistic Regression z Asymptotic comparison training examples infinity z when model assumptions correct z z NB LR produce identical classifiers when model assumptions incorrect z LR is less biased does not assume conditional independence z therefore expected to outperform NB 9 Na ve Bayes vs Logistic Regression z Non asymptotic analysis see Ng Jordan 2002 z convergence rate of parameter estimates how many training examples needed to assure good estimates NB order log n where n of attributes in X LR order n z NB converges more quickly to its perhaps less helpful asymptotic estimates Rate of convergence logistic regression z Let hDis m be logistic regression trained on m examples in n dimensions Then with high probability z Implication if we want for some small constant 0 it suffices to pick order n examples Convergences to its asymptotic classifier in order n examples z result follows from Vapnik s structural …
View Full Document