11©Carlos Guestrin 2005-2007Logistic RegressionMachine Learning – 10701/15781Carlos GuestrinCarnegie Mellon UniversitySeptember 24th, 20072©Carlos Guestrin 2005-2007Generative v. Discriminativeclassifiers – Intuition Want to Learn: h:X a Y X – features Y – target classes Bayes optimal classifier – P(Y|X) Generative classifier, e.g., Naïve Bayes: Assume some functional form for P(X|Y), P(Y) Estimate parameters of P(X|Y), P(Y) directly from training data Use Bayes rule to calculate P(Y|X= x) This is a ‘generative’ model Indirect computation of P(Y|X) through Bayes rule But, can generate a sample of the data, P(X) = ∑y P(y) P(X|y) Discriminative classifiers, e.g., Logistic Regression: Assume some functional form for P(Y|X) Estimate parameters of P(Y|X) directly from training data This is the ‘discriminative’ model Directly learn P(Y|X) But cannot obtain a sample of the data, because P(X) is not available23©Carlos Guestrin 2005-2007Logistic RegressionLogisticfunction(or Sigmoid): Learn P(Y|X) directly! Assume a particular functional form Sigmoid applied to a linear functionof the data:ZFeatures can be discrete or continuous!4©Carlos Guestrin 2005-2007Understanding the sigmoid-6 -4 -2 0 2 4 600.10.20.30.40.50.60.70.80.91w0=0, w1=-1-6 -4 -2 0 2 4 600.10.20.30.40.50.60.70.80.91w0=-2, w1=-1-6 -4 -2 0 2 4 600.10.20.30.40.50.60.70.80.91w0=0, w1=-0.535©Carlos Guestrin 2005-2007Logistic Regression –a Linear classifier-6 -4 -2 0 2 4 600.10.20.30.40.50.60.70.80.916©Carlos Guestrin 2005-2007Very convenient!impliesimpliesimplieslinearclassificationrule!47©Carlos Guestrin 2005-2007What if we have continuous Xi ?Eg., character recognition: Xi is ith pixelGaussian Naïve Bayes (GNB):Sometimes assume variance is independent of Y (i.e., σi), or independent of Xi (i.e., σk) or both (i.e., σ)8©Carlos Guestrin 2005-2007Example: GNB for classifying mental states~1 mm resolution~2 images per sec.15,000 voxels/imagenon-invasive, safemeasures BloodOxygen LevelDependent (BOLD)responseTypicalimpulseresponse10 sec[Mitchell et al.]59©Carlos Guestrin 2005-2007Learned Bayes Models – Means forP(BrainActivity | WordCategory)Animal wordsPeople wordsPairwise classification accuracy: 85%[Mitchell et al.]10©Carlos Guestrin 2005-2007Logistic regression v. Naïve Bayes Consider learning f: X Y, where X is a vector of real-valued features, < X1 … Xn > Y is boolean Could use a Gaussian Naïve Bayes classifier assume all Xi are conditionally independent given Y model P(Xi | Y = yk) as Gaussian N(µik,σi) model P(Y) as Bernoulli(θ,1-θ) What does that imply about the form of P(Y|X)?Cool!!!!611©Carlos Guestrin 2005-2007Derive form for P(Y|X) for continuous Xi12©Carlos Guestrin 2005-2007Ratio of class-conditional probabilities713©Carlos Guestrin 2005-2007Derive form for P(Y|X) for continuous Xi14©Carlos Guestrin 2005-2007Gaussian Naïve Bayes v. Logistic Regression Representation equivalence But only in a special case!!! (GNB with class-independent variances) But what’s the difference??? LR makes no assumptions about P(X|Y) in learning!!! Loss function!!! Optimize different functions ! Obtain different solutionsSet of Gaussian Naïve Bayes parameters(feature variance independent of class label)Set of Logistic Regression parameters815©Carlos Guestrin 2005-2007Logistic regression for morethan 2 classes Logistic regression in more general case, whereY 2 {Y1 ... YR} : learn R-1 sets of weights16©Carlos Guestrin 2005-2007Logistic regression more generally Logistic regression in more general case, where Y 2{Y1 ... YR} : learn R-1 sets of weightsfor k<Rfor k=R (normalization, so no weights for this class)Features can be discrete or continuous!917©Carlos Guestrin 2005-2007Loss functions: Likelihood v.Conditional Likelihood Generative (Naïve Bayes) Loss function:Data likelihood Discriminative models cannot compute P(xj|w)! But, discriminative (logistic regression) loss function:Conditional Data Likelihood Doesn’t waste effort learning P(X) – focuses on P(Y|X) all that matters forclassification18©Carlos Guestrin 2005-2007Expressing Conditional Log Likelihood1019©Carlos Guestrin 2005-2007Maximizing Conditional Log LikelihoodGood news: l(w) is concave function of w ! no locally optimalsolutionsBad news: no closed-form solution to maximize l(w)Good news: concave functions easy to optimize20©Carlos Guestrin 2005-2007Optimizing concave function –Gradient ascent Conditional likelihood for Logistic Regression is concave ! Findoptimum with gradient ascent Gradient ascent is simplest of optimization approaches e.g., Conjugate gradient ascent much better (see reading)Gradient:Learning rate, η>0Update rule:1121©Carlos Guestrin 2005-2007Maximize Conditional Log Likelihood:Gradient ascent22©Carlos Guestrin 2005-2007Gradient Descent for LRGradient ascent algorithm: iterate until change < εFor i = 1… n,repeat1223©Carlos Guestrin 2005-2007That’s all M(C)LE. How about MAP? One common approach is to define priors on w Normal distribution, zero mean, identity covariance “Pushes” parameters towards zero Corresponds to R egularization Helps avoid very large weights and overfitting More on this later in the semester MAP estimate24©Carlos Guestrin 2005-2007M(C)AP as RegularizationPenalizes high weights, also applicable in linear regression1325©Carlos Guestrin 2005-2007Gradient of M(C)AP26©Carlos Guestrin 2005-2007MLE vs MAP Maximum conditional likelihood estimate Maximum conditional a posteriori estimate1427©Carlos Guestrin 2005-2007Naïve Bayes vs Logistic RegressionConsider Y boolean, Xi continuous, X=<X1 ... Xn>Number of parameters: NB: 4n +1 LR: n+1Estimation method: NB parameter estimates are uncoupled LR parameter estimates are coupled28©Carlos Guestrin 2005-2007G. Naïve Bayes vs. Logistic Regression 1 Generative and Discriminative classifiers Asymptotic comparison (# training examples infinity) when model correct GNB, LR produce identical classifiers when model incorrect LR is less biased – does not assume conditional independence therefore LR expected to outperform GNB[Ng & Jordan, 2002]1529©Carlos Guestrin 2005-2007G. Naïve Bayes vs. Logistic Regression 2 Generative and Discriminative classifiers Non-asymptotic analysis convergence rate of
View Full Document