1Machine LearningMachine Learning1010--701/15701/15--781, Spring 2008781, Spring 2008Logistic RegressionLogistic Regression----generative verses discriminative generative verses discriminative classifier classifier Eric XingEric XingLecture 5, January 30, 2006Reading: Chap. 3.1.3-4 CBGenerative vs. Discriminative Classifiersz Goal: Wish to learn f: X → Y, e.g., P(Y|X)z Generative classifiers (e.g., Naïve Bayes):z Assume some functional form for P(X|Y), P(Y)This is a ‘generative’ model of the data!z Estimate parameters of P(X|Y), P(Y) directly from training dataz Use Bayes rule to calculate P(Y|X= xi)z Discriminative classifiers:z Directly assume some functional form for P(Y|X)This is a ‘discrimina tive’ model of the data!z Estimate parameters of P(Y|X) directly from training dataYiXiYiXi2Consider the a Gaussian Generative Classifier z learning f: X → Y, wherez X is a vector of real-valued features, < X1…Xn >z Y is booleanz What does that imply about the form of P(Y|X)?z The joint probability of a datum and its label is:z Given a datum xn, we predict its label using the conditional probability of the label given the datum:YiXi{}221212221111)-(-exp)(),,|()(),|,(/knkknnknknnxyxpypyxpµπσπσµσµσ==×==={}{}∑==''/'/)-(-exp)()-(-exp)(),,|(kknkknknknxxxyp2212122212122221211µπσπµπσπσµσσNaïve Bayes Classifier z When X is multivariate-Gaussian vector:z The joint probability of a datum and it label is:z The naïve Bayes simplificationz More generally:z Where p(. | .) is an arbitrary conditional (discrete or continuous) 1-D density{})-()-(-exp)(),,|()(),|,(/knTknkknnknknnxxyxpypyxpµµππµµrrrrrrrr1212121111−ΣΣ=Σ=×==Σ=YiXi{}∏∏==×===jjkjnjkkjjkjkknjnknknnxyxpypyxpjk221212221111)-(-exp)(),,|()(),|,(,/,,,,µπσπσµσµσ∏=×=mjnjnnnnyxpypyxp1),|()|(),|,(ηππηYiXi1Xi2Xim…3The predictive distributionz Understanding the predictive distributionz Under naïve Bayes assumption: z For two class (i.e., K=2), and when the two classes haves the same variance, ** turns out to be the logistic function* ),|,(),|,(),|(),,|,(),,,|(''''∑ΣΣ=ΣΣ==Σ=kkknkkknknnknnknxNxNxpxypxypµπµπµπµπµvvv11(){}1122212211222212122112111111ππσσσµπσµπµµµµσσ)(2log)-(explog)-(explog)][-]([)-(exp −⎪⎭⎪⎬⎫⎪⎩⎪⎨⎧⎟⎟⎠⎞⎜⎜⎝⎛−−−⎪⎭⎪⎬⎫⎪⎩⎪⎨⎧⎟⎟⎠⎞⎜⎜⎝⎛−−−++−+=∑∑+=∑jjjjjjnCxCxjjjjjjnjjjjjnjxnTxeθ−+=11 )|(nnxyp 11=** log)(explog)(exp),,,|(','','',,∑∑∑⎪⎭⎪⎬⎫⎪⎩⎪⎨⎧⎟⎟⎠⎞⎜⎜⎝⎛−−−−⎪⎭⎪⎬⎫⎪⎩⎪⎨⎧⎟⎟⎠⎞⎜⎜⎝⎛−−−−=Σ=kjjkjkjnjkkjjkjkjnjkknknCxCxxypσµσπσµσππµ222221211vThe decision boundaryz The predictive distributionz The Bayes decision rule:z For multiple class (i.e., K>2), * correspond to a softmax function∑−−==jxxnknnTjnTkeexypθθ )|( 1nTxMjjnjnnexxypθθθ−=+=⎭⎬⎫⎩⎨⎧−−+==∑11111011 exp )|(nTxxxnnnnxeeexypxypnTnTnTθθθθ=⎟⎟⎟⎟⎠⎞⎜⎜⎜⎜⎝⎛++===−−−1111121 ln )|()|(ln4Discussion: Generative and discriminative classifiersz Generative:z Modeling the joint distribution of all dataz Discriminative:z Modeling only points at the boundaryz How? Regression!Linear regression z The data:z Both nodes are observed:z X is an input vectorz Y is a response vector (we first consider y as a generic continuous response vector, then we consider the special case of classification where y is a discrete indicator)z A regression scheme can be used to model p(y|x) directly,rather than p(x,y)YiXiN{}),(,),,(),,(),,(NNyxyxyxyx L3322115Logistic regression (sigmoid classifier)z The condition distribution: a Bernoulliwhere µis a logistic functionz We can used the brute-force gradient method as in LRz But we can also apply generic laws by observing the p(y|x) is an exponential family function, more specifically, a generalized linear model (see next lecture!)yyxxxyp−−=11 ))(()()|(µµxTexθµ−+=11)(Training Logistic Regression: MCLEz Estimate parameters θ=<θ0, θ1, ... θn> to maximize the conditional likelihood of training dataz Training data z Data likelihood = z Data conditional likelihood =6Expressing Conditional Log Likelihoodz Recall the logistic function:and conditional likelihood: Maximizing Conditional Log Likelihoodz The objective:z Good news: l(θ) is concave function of θz Bad news: no closed-form solution to maximize l(θ)7Gradient Ascentz Property of sigmoid function:z The gradient:The gradient ascent algorithm iterate until change < εFor all i,repeatHow about MAP?z It is very common to use regularized maximum likelihood.z One common approach is to define priors on θz – Normal distribution, zero mean, identity covariancez Helps avoid very large weights and overfittingz MAP estimate8MLE vs MAPz Maximum conditional likelihood estimatez Maximum a posteriori estimateLogistic regression: practical issuesz IRLS takes O(Nd3) per iteration, where N= number of training cases and d= dimension of input x.z Quasi-Newton methods, that approximate the Hessian, work faster.z Conjugate gradient takes O(Nd) per iteration, and usually works best in practice.z Stochastic gradient descent can also be used if Nis large c.f. perceptron rule:9Naïve Bayes vs Logistic Regressionz Consider Y boolean, X continuous, X=<X1... Xn>z Number of parameters to estimate:NB:LR:z Estimation method:z NB parameter estimates are uncoupledz LR parameter estimates are coupledNaïve Bayes vs Logistic Regressionz Asymptotic comparison (# training examples → infinity)z when model assumptions correctz NB, LR produce identical classifiersz when model assumptions incorrectz LR is less biased – does not assume conditional independencez therefore expected to outperform NB10Naïve Bayes vs Logistic Regressionz Non-asymptotic analysis (see [Ng & Jordan, 2002] )z convergence rate of parameter estimates – how many training examples needed to assure good estimates?NB order log n (where n = # of attributes in X)LR order nz NB converges more quickly to its (perhaps less helpful) asymptotic estimatesRate of convergence: logistic regressionz Let hDis,mbe logistic regression trained on m examples in n dimensions. Then with high probability:z Implication: if we want for some small constant ε0, it suffices to pick order n examplesÆ Convergences to its asymptotic classifier, in order n examplesz result follows from Vapnik’s structural risk bound, plus fact that the "VC Dimension" of an
View Full Document