Logistic RegressionMachine Learning 10-601Tom M. MitchellMachine Learning DepartmentCarnegie Mellon UniversityFebruary 11, 2008Required reading: Mitchell draft chapter on “Naïve Bayes and Logistic Regression.”(available on class website)Naïve Bayes in a NutshellBayes rule:Assuming conditional independence among Xi’s:So, classification rule for Xnew= < X1, …, Xn> is:Logistic RegressionIdea:• Naïve Bayes implies some functional form for P(Y|X)• Why not derive the form for P(Y|X), then learn its parameters directly?• Consider learning f: X Æ Y, where• X is a vector of real-valued features, < X1…Xn>• Y is boolean• assume all Xiare conditionally independent given Y• model P(Xi| Y = yk) as Gaussian N(μi,σi)• model P(Y) as Bernoulli (π)• What does that imply about the form of P(Y|X)?Derive form for P(Y|X) for continuous XiLogistic functionVery convenient!impliesimpliesimpliesVery convenient!impliesimpliesimplieslinear classification rule!Logistic regression more generally• Logistic regression in more general case, where y ∈{y1... yR} : learn R-1 sets of weightsfor k<Rfor k=RTraining Logistic Regression: MCLE• Choose parameters W=<w0, ... wn> to maximize conditional likelihoodof training data• Training data D = • Data likelihood = • Data conditionallikelihood = whereExpressing Conditional Log LikelihoodMaximizing Conditional Log LikelihoodGood news: l(W) is concave function of WBad news: no closed-form solution to maximize l(W)Maximize Conditional Log Likelihood: Gradient AscentGradient ascent algorithm: iterate until change < εRepeat: For all i,That’s all for M(C)LE. How about MAP?• One common approach is to define priors on W– Normal distribution, zero mean, identity covariance• Helps avoid very large weights and overfitting• MAP estimateMLE vs MAP • Maximum conditional likelihood estimate• Maximum a posteriori estimateGenerative vs. Discriminative ClassifiersTraining classifiers involves estimating f: X Æ Y, or P(Y|X)Generative classifiers: (e.g., Naïve Bayes)• Assume some functional form for P(X|Y), P(Y)• Estimate parameters of P(X|Y), P(Y) from training data• Use Bayes rule to calculate P(Y|X= xi)Discriminative classifiers: (e.g., Logistic Regression)• Assume some functional form for P(Y|X)• Estimate parameters of P(Y|X) directly from training dataNaïve Bayes vs Logistic RegressionConsider Y boolean, Xicontinuous, X=<X1... Xn>Number of parameters to estimate:•NB: •LR:Naïve Bayes vs Logistic RegressionConsider Y boolean, Xicontinuous, X=<X1... Xn>Number of parameters:• NB: 4n +1• LR: n+1Estimation method:• NB parameter estimates are uncoupled• LR parameter estimates are coupledG.Naïve Bayes vs. Logistic Regression• Generative and Discriminative classifiers• Asymptotic comparison (as # training examples Æ infinity)• when model assumptions correct• GNB, LR produce identical classifiers• in fact, the Bayes optimal classifier• when model assumptions incorrect• LR is less biased – does not assume cond indep.• therefore expected to outperform GNB[Ng & Jordan, 2002]Naïve Bayes vs. Logistic Regression• Generative and Discriminative classifiers• Non-asymptotic analysis (see [Ng & Jordan, 2002] )• convergence rate of parameter estimates – how many training examples needed to assure good estimates?• GNB order log n (where n = # of attributes in X)• LR order nGNB converges more quickly to its (perhaps less correct) asymptotic estimatesRate of covergence: logistic regressionLet hDis,mbe logistic regression trained on m examples in ndimensions. Then with high probability:Implication: if we wantfor some constant , it suffices to pick m to be order n examples Æ Convergences to its asymptotic classifier, in order n examples(result follows from Vapnik’s structural risk bound, plus fact that VCDim of n dimensional linear separators is n )[Ng & Jordan, 2002]Rate of covergence: naïve Bayes parameters[Ng & Jordan, 2002]Some experiments from UCI data setsWhat you should know:• Logistic regression– Functional form follows from Naïve Bayes assumptions• For Gaussian Naïve Bayes assuming variance σi,k= σι• For discrete-valued Naïve Bayes too– But training procedure picks parameters without the conditional independence assumption– MLE training: pick W to maximize P(Y | X, W)– MAP training: pick W to maximize P(W | X,Y)• ‘regularization’• helps reduce overfitting • Gradient ascent/descent– General approach when closed-form solutions unavailable• Generative vs. Discriminative classifiers– Bias vs. variance
View Full Document