11Logistic RegressionMachine Learning – 10701/15781Carlos GuestrinCarnegie Mellon UniversitySeptember 28th, 2009©Carlos Guestrin 2005-2009Logistic RegressionLogisticfunction(or Sigmoid): Learn P(Y|X) directly! Assume a particular functional form Sigmoid applied to a linear function of the data:ZFeatures can be discrete or continuous!2©Carlos Guestrin 2005-20092Logistic Regression –a Linear classifier-6 -4 -2 0 2 4 600.10.20.30.40.50.60.70.80.913©Carlos Guestrin 2005-2009Logistic regression for more than 2 classes Logistic regression in more general case, where Y ∈ {y1,…,yR}4©Carlos Guestrin 2005-20093Logistic regression more generally Logistic regression in more general case, where Y ∈ {y1,…,yR}for k<Rfor k=R (normalization, so no weights for this class)Features can be discrete or continuous!5©Carlos Guestrin 2005-2009Loss functions: Likelihood v. Conditional Likelihood Generative (Naïve Bayes) Loss function: Data likelihood Discriminative models cannot compute P(xj|w)! But, discriminative (logistic regression) loss function:Conditional Data Likelihood Doesn’t waste effort learning P(X) – focuses on P(Y|X) all that matters for classification 6©Carlos Guestrin 2005-20094Expressing Conditional Log Likelihood7©Carlos Guestrin 2005-2009Maximizing Conditional Log LikelihoodGood news: l(w) is concave function of w →→→→ no locally optimal solutionsBad news: no closed-form solution to maximize l(w)Good news: concave functions easy to optimize8©Carlos Guestrin 2005-20095Optimizing concave function –Gradient ascent Conditional likelihood for Logistic Regression is concave → Find optimum with gradient ascent Gradient ascent is simplest of optimization approaches e.g., Conjugate gradient ascent much better (see reading)Gradient:Learning rate, ηηηη>0Update rule:9©Carlos Guestrin 2005-2009Maximize Conditional Log Likelihood: Gradient ascent10©Carlos Guestrin 2005-20096Gradient Descent for LRGradient ascent algorithm: iterate until change < εFor i=1,…,n, repeat 11©Carlos Guestrin 2005-200912That’s all M(C)LE. How about MAP? One common approach is to define priors on w Normal distribution, zero mean, identity covariance “Pushes” parameters towards zero Corresponds to Regularization Helps avoid very large weights and overfitting More on this later in the semester MAP estimate©Carlos Guestrin 2005-2009713M(C)AP as RegularizationPenalizes high weights, also applicable in linear regression©Carlos Guestrin 2005-200914Large parameters → Overfitting If data is linearly separable, weights go to infinity Leads to overfitting: Penalizing high weights can prevent overfitting… again, more on this later in the semester©Carlos Guestrin 2005-2009815Gradient of M(C)AP©Carlos Guestrin 2005-200916MLE vs MAP Maximum conditional likelihood estimate Maximum conditional a posteriori estimate©Carlos Guestrin 2005-20099Logistic regression v. Naïve Bayes Consider learning f: X Y, where X is a vector of real-valued features, < X1 … Xn > Y is boolean Could use a Gaussian Naïve Bayes classifier assume all Xiare conditionally independent given Y model P(Xi| Y = yk) as Gaussian N(µik,σi) model P(Y) as Bernoulli(θ,1-θ) What does that imply about the form of P(Y|X)?Cool!!!!17©Carlos Guestrin 2005-2009Derive form for P(Y|X) for continuous Xi18©Carlos Guestrin 2005-200910Ratio of class-conditional probabilities19©Carlos Guestrin 2005-2009Derive form for P(Y|X) for continuous Xi20©Carlos Guestrin 2005-200911Gaussian Naïve Bayes v. Logistic Regression Representation equivalence But only in a special case!!! (GNB with class-independent variances) But what’s the difference??? LR makes no assumptions about P(X|Y) in learning!!! Loss function!!! Optimize different functions → Obtain different solutionsSet of Gaussian Naïve Bayes parameters(feature variance independent of class label)Set of Logistic Regression parameters21©Carlos Guestrin 2005-2009Naïve Bayes vs Logistic RegressionConsider Y boolean, Xicontinuous, X=<X1... Xn>Number of parameters: NB: 4n +1 LR: n+1Estimation method: NB parameter estimates are uncoupled LR parameter estimates are coupled22©Carlos Guestrin 2005-200912G. Naïve Bayes vs. Logistic Regression 1 Generative and Discriminative classifiers Asymptotic comparison (# training examples infinity) when model correct GNB (with class independent variances) and LR produce identical classifiers when model incorrect LR is less biased – does not assume conditional independence therefore LR expected to outperform GNB[Ng & Jordan, 2002]23©Carlos Guestrin 2005-2009G. Naïve Bayes vs. Logistic Regression 2 Generative and Discriminative classifiers Non-asymptotic analysis convergence rate of parameter estimates, n = # of attributes in X Size of training data to get close to infinite data solution GNB needs O(log n) samples LR needs O(n) samplesGNB converges more quickly to its (perhaps less helpful) asymptotic estimates[Ng & Jordan, 2002]24©Carlos Guestrin 2005-200913Some experiments from UCI data setsNaïve bayesLogistic Regression25©Carlos Guestrin 2005-2009What you should know about Logistic Regression (LR) Gaussian Naïve Bayes with class-independent variances representationally equivalent to LR Solution differs because of objective (loss) function In general, NB and LR make different assumptions NB: Features independent given class → assumption on P(X|Y) LR: Functional form of P(Y|X), no assumption on P(X|Y) LR is a linear classifier decision rule is a hyperplane LR optimized by conditional likelihood no closed-form solution concave → global optimum with gradient ascent Maximum conditional a posteriori corresponds to regularization Convergence rates GNB (usually) needs less data LR (usually) gets to better solutions in the limit26©Carlos Guestrin
View Full Document