10-601 Recitation #4 Gaussian Naive Bayes and Logistic Regression October 5th, 2011 Shing-hon Lau Office hours: Friday 3-4 PMAgenda HW #2 due tomorrow 5 PM Submit written copy and post code to Blackboard Gaussian Naive Bayes Logistic Regression Gradient Descent Discriminative vs. Generative classifiers Bias/Variance TradeoffNaive Bayes Features are conditionally independent given class Assumed that all variables were binary (or discrete) What if we have continuous features?Gaussian Naive Bayes Still assume features are conditionally independent given class Generally assume Gaussian features (why?) Can use other distributions as wellGaussian Distribution Also called normal distribution You will compute the MLE and MAP estimates in HW2Why Gaussian? Many natural phenomenon are normally distributed Biological functions (height, weight, etc.) Central Limit Theorem implies that sample means tend to a normal distribution Mathematically easy to work withWorking with Continuous Variables Discrete variables: Continuous variables:Generative vs. Discriminative Classifiers Generative classifiers learn P(X|Y) Use Bayes rule to calculate P(Y|X) Discriminative classifiers learn P(Y|X) Which type is Naive Bayes?Generative vs. Discriminative Classifiers Discriminative classifiers are only good for classification Generative classifiers enable other tasks (e.g., data generation) Generally speaking, generative is more accurate with less data, discriminative with more dataLogistic Regression Example of a discriminative classifierLogistic Regression Can handle arbitrarily many classesLogistic RegressionFinding the weights We were able to derive an analytic expression for the weights for the special Gaussian case How can we find the weights in the general case?Good weightsMaximum Conditional Likelihood EstimateMaximum Conditional Likelihood EstimateMaximizing l(w) No closed form for the maximum So how do we find the MCLE?Gradient Descent Iterative optimization algorithm Basic idea: Find the direction of greatest decrease Take a small step in that direction Repeat these two steps until we are satisfied Usually stop when change is very small Guaranteed to find optimal point in some cases1-D Gradient Descent 1-D gradient is the derivative1-D Gradient Descent Start at a random point (1, 1)1-D Gradient Descent Take a step in the negative gradient direction1-D Gradient Descent Repeat the process1-D Gradient Descent Eventually converge to the optimum pointProblems with Gradient Descent If step-size is too big, can end up going back and forth between two values If the function is not convex/concave, we may end up in a local optimaBad Step-Size Start at (1, 1)Bad Step-Size Step to (-1, 1)Bad Step-Size Step back to (1, 1)Non-convexityNon-convexityNon-convexityGradient Descent Generally have to shrink your step size as you continue to iterate Try to stick to convex/concave functions Do random restarts if you must use a non-convex/non-concave objectiven-dimensional Gradient Descent Use partial derivatives to compute the gradient Partial derivatives:n-dimensional Gradient Descent Gradient is n-dimensional vector Gradient direction is the direction of greatest increase First-order (i.e., linear) approximation Second-order Newton methodsLogistic Regression and Gradient Descent Objective function is concave!MAP and Logistic RegressionMAP and Logistic Regression Use a MAP estimate to avoid overfitting (just like Naive Bayes) What can happen to weights without regularization term? What does the regularization term help do?Other Forms of Regularization Can apply a Lasso or Ridge penalty to weights Lasso makes many weights zero Ridge shrinks all of the weightsBias/Variance Tradeoff Simpler models are more biased because they make more assumptions More complex models are more variable, since they depend on the particulars of the data provided Have to trade the two off to get the best classifier possibleExtra SlidesCorrelated Features Worst case scenario: duplicated features What will Naive Bayes do with duplicated features? What will Logistic Regression do with duplicated features? What if there is just correlation?Estimating Some Features Jointly What if we are not willing to assume that all features are conditionally independent? How can we do Naive Bayes? What is the price we pay for not assuming conditional independence?Estimating Some Features Jointly Graphical models are a formalization of this idea Can do things like Tree-Augmented Naive Bayes (TAN) More general framework for an arbitrary set of conditional independence assumptionsNeural Network Preview One sigmoid function is good (Logistic Regression), so more must be better Can chain them so that the output of one are the inputs to the next “Mimics” the brain (kind of), so such systems are termed Neural NetworksNumerical Gradient Descent What if we can only evaluate the function but cannot evaluate its derivative? Can take tiny steps in each direction to determine gradient Generally a lot more expensive because of all the function
View Full Document