Review: Logistic regression, Gaussian naïve Bayes, linear regression, and their connectionsYi Zhang10-701, Machine Learning, Spring 2011February 3rd, 2011Parts of the slides are from previous 10-701 lectures1New: Bias-variance decomposition, bias-variance tradeoff, overfitting, regularization, and feature selectionOutline Logistic regression Decision surface (boundary) of classifiers Generative vs. discriminative classifiers Linear regression Bias-variance decomposition and tradeoff Overfitting and regularization Feature selection2Outline Logistic regression◦ Model assumptions: P(Y|X)◦ Decision making◦ Estimating the model parameters◦ Multiclass logistic regressionDecision surface (boundary) of classifiers Generative vs. discriminative classifiers Linear regression Bias-variance decomposition and tradeoff Overfitting and regularization Feature selection3Logistic regression: assumptions Binary classification◦ f: X = (X1, X2, … Xn) Y {0, 1} Logistic regression: assumptions on P(Y|X):◦ And thus:4Logistic regression: assumptions Model assumptions: the form of P(Y|X) “Logistic” regression◦ P(Y|X) is the logistic function applied to a linear function of X5Decision making Given a logistic regression w and an X: Decision making on Y:Linear decision boundary ![Aarti, 10-701]6Estimating the parameters w Given◦ where How to estimate w = (w0, w1, …, wn)?[Aarti, 10-701]7Estimating the parameters w Given , Assumptions: P(Y|X, w) Maximum conditional likelihood on data!◦ Logistic regression only models P(Y|X)◦ So we only maximize P(Y|X), ignoring P(X)8Estimating the parameters w Given , Assumptions: Maximum conditional likelihood on data!◦ Let’s maximize conditional log-likelihood9Estimating the parameters w Max conditional log-likelihood on data◦ A concave function (beyond the scope of class)◦ No local optimum: gradient ascent (descent) 10Estimating the parameters w Max conditional log-likelihood on data◦ A concave function (beyond the scope of class)◦ No local optimum: gradient ascent (descent) 11Multiclass logistic regression Binary classification K-class classification◦ For each class k < K◦ For class K12Outline Logistic regression Decision surface (boundary) of classifiers◦ Logistic regression◦ Gaussian naïve Bayes◦ Decision trees Generative vs. discriminative classifiers Linear regression Bias-variance decomposition and tradeoff Overfitting and regularization Feature selection13Logistic regression Model assumptions on P(Y|X): Deciding Y given X:Linear decision boundary ![Aarti, 10-701]14Gaussian naïve Bayes Model assumptions P(X,Y) = P(Y)P(X|Y)◦ Bernoulli on Y:◦ Conditional independence of X◦ Gaussian for Xigiven Y: Deciding Y given X1516P(X|Y=0)P(X|Y=1)Gaussian naïve Bayes: nonlinear case Again, assume P(Y=1) = P(Y=0) = 0.517P(X|Y=0)P(X|Y=1)Decision trees18 Decision making on Y: follow the tree structure to a leafOutline Logistic regression Decision surface (boundary) of classifiers Generative vs. discriminative classifiers◦ Definitions◦ How to compare them◦ GNB-1 vs. logistic regression◦ GNB-2 vs. logistic regression Linear regression Bias-variance decomposition and tradeoff Overfitting and regularization Feature selection19Generative and discriminative classifiers Generative classifiers◦ Modeling the joint distribution P(X, Y)◦ Usually via P(X,Y) = P(Y) P(X|Y)◦ Examples: Gaussian naïve Bayes Discriminative classifiers◦ Modeling P(Y|X) or simply f: XY◦ Do not care about P(X)◦ Examples: logistic regression, support vector machines (later in this course)20Generative vs. discriminative How can we compare, say, Gaussian naïve Bayes and a logistic regression?◦ P(X,Y) = P(Y) P(X|Y) vs. P(Y|X) ? Hint: decision making is based on P(Y|X)◦ Compare the P(Y|X) they can represent !21Two versions: GNB-1 and GNB-2 Model assumptions on P(X,Y) = P(Y)P(X|Y)◦ Bernoulli on Y:◦ Conditional independence of X◦ Gaussian on Xi|Y:◦ (Additionally,) class-independent variance22GNB-1GNB-2Two versions: GNB-1 and GNB-2 Model assumptions on P(X,Y) = P(Y)P(X|Y)◦ Bernoulli on Y:◦ Conditional independence of X◦ Gaussian on Xi|Y:◦ (Additionally,) class-independent variance23GNB-1GNB-2Impossible for GNB-2P(X|Y=0)P(X|Y=1)GNB-2 vs. logistic regression GNB-2: P(X,Y) = P(Y)P(X|Y)◦ Bernoulli on Y:◦ Conditional independence of X, and Gaussian on Xi◦ Additionally, class-independent variance It turns out, P(Y|X) of GNB-2 has the form:24GNB-2 vs. logistic regression It turns out, P(Y|X) of GNB-2 has the form:◦ See [Mitchell: Naïve Bayes and Logistic Regression], section 3.1 (page 8 – 10) Recall: P(Y|X) of logistic regression:25GNB-2 vs. logistic regression P(Y|X) of GNB-2 is subset of P(Y|X) of LR Given infinite training data◦ We claim: LR >= GNB-2 26GNB-1 vs. logistic regression GNB-1: P(X,Y) = P(Y)P(X|Y)◦ Bernoulli on Y:◦ Conditional independence of X, and Gaussian on Xi Logistic regression: P(Y|X)27GNB-1 vs. logistic regression None of them encompasses the other First, find a P(Y|X) from GNB-1 that cannotbe represented by LR◦ LR only represents linear decision surfaces28P(X|Y=0)P(X|Y=1)GNB-1 vs. logistic regression None of them encompasses the other Second, find a P(Y|X) represented by LR that cannot be derived from GNB-1assumptions◦ GNB-1 cannot represent any correlated Gaussian◦ But can still possibly be represented by LR (HW2)29P(X|Y=0)P(X|Y=1)Outline Logistic regression Decision surface (boundary) of classifiers Generative vs. discriminative classifiers Linear regression◦ Regression problems◦ Model assumptions: P(Y|X)◦ Estimate the model parameters Bias-variance decomposition and tradeoff Overfitting and regularization Feature selection30Regression problems Regression problems: ◦ Predict Y given X◦ Y is continuous◦ General assumption:31[Aarti, 10-701]Linear regression: assumptions Linear regression assumptions◦ Y is generated from f(X) plus Gaussian noise◦ f(X) is a linear function32Linear regression: assumptions Linear regression assumptions◦ Y is generated from f(X) plus Gaussian noise◦ f(X) is a linear function Therefore, assumptions on P(Y|X, w):33Linear regression: assumptions Linear regression
View Full Document