Logistic Regression William Cohen 1 Outline Quick review classification na ve Bayes perceptrons Learning as optimization Logistic regression via gradient descent Overfitting and regularization Discriminative vs generative classifiers 2 3 4 REVIEW 5 Quick Review Probabilities are useful for modeling uncertainty and nondeterministic processes A classifier takes as input feature values x x1 xn and produces a predicted class label y Classification algorithms can be based on probability model the joint distribution of X1 Xn Y and predict argmaxy Pr y x1 xn Model Pr X1 Xn Y with a big table brute force Bayes Model Pr X1 Xn Y with Bayes s rule and conditional independence assumptions na ve Bayes Another classifier incrementally modifies a hyperplane v until it stops making mistakes perceptron Both the perceptron and na ve Bayes have a linear decision boundary i e the classifier is of the form sign x w what s different is which weight vector w is used 6 LEARNING AS OPTIMIZATION 7 Learning as optimization The main idea today Make two assumptions the classifier is linear ie roughly of the form P y x w sign x w we want to find the classifier w that fits the data best Formalize fits the data best as a function fD w often f is called a loss function Find the argminw fD w This is an optimization problem 8 Learning as optimization warmup Goal Find the best parameter of a binomial Dataset D x1 xn xi is 0 or 1 k of them are 1 P D q q xi 1 q 1 xi qk 1 q n k i d d k P D q q 1 q n k dq dq d k n k k d q 1 q q 1 q n k dq dq kqk 1 1 q n k qk n k 1 q n k 1 1 qk 1 1 q n k 1 k 1 q q n k 9 Learning as optimization warmup Goal Find the best parameter of a binomial Dataset D x1 xn xi is 0 or 1 k of them are 1 0 0 1 k k n k 0 n k k n 10 Learning as optimization general procedure Goal Learn the parameter w of Dataset D x1 y1 xn yn Write down a loss function LossD w Set w to minimize Loss Usually we use numeric methods to find the optimum i e gradient descent repeatedly take a small step in the direction of the gradient 11 Gradient descent To find argminx f x Start with x0 For t 1 xt 1 xt f x t where is small 12 Gradient descent 13 Pros and cons of gradient descent Simple and often quite effective on ML tasks Only applies to smooth functions differentiable Might find a local minimum rather than a global one 14 Pros and cons of gradient descent There is only one local optimum if the function is convex 15 LEARNING LINEAR CLASSIFIERS WITH GRADIENT DESCENT PART 1 THE IDEA 16 Using gradient descent for linear classifiers 1 Replace sign x v with something differentiable e g the logistic x v 1 logistic u 1 e u sign x 1 P Y pos X x x w 1 e 17 Using gradient descent for linear classifiers 1 2 Replace sign x w with something differentiable e g the logistic x w Define a function on the data that formalizes how well w predicts the data loss function 1 P y X x 1 e x w 1 if yi 1 1 exp x w i P yi xi w 1 Assume y 0 or y 1 1 if yi 0 1 exp xi w Our loss function is log of conditional likelihood LossD w log P yi xi w i 18 Using gradient descent for linear classifiers 1 2 3 Replace sign x w with something differentiable e g the logistic x w Define a function on the data that formalizes how well w predicts the data loss function Differentiate the loss function and use gradient descent Start with w0 For t 1 wt 1 wt Loss D w t where is small Warning this is not a toy logistic regression LossD w log P yi xi w i 19 Using gradient descent for linear classifiers 1 2 3 Replace sign x w with something differentiable e g the logistic x w Define a function on the data that formalizes how well w predicts the data loss function is a a weight vector w classifier Differentiate the loss function and use gradient descent Start with w0 For t 1 wt 1 wt Loss D w t where should is small be a w that Final result fits the data well according to w1 LossD w w0 2 w 1 4 w0 20 Stochastic Gradient Descent SGD Goal Learn the parameter of Dataset D x1 xn or maybe D x1 y1 xn yn Write down Pr D as a function of For large datasets this is expensive we don t want to load all the data D into memory and the gradient depends on all the data An alternative B one pick a small subset of examples B D example is a very approximate the gradient using them on average this is the right direction take a step in that direction repeat popular choice 21 Using SGD for logistic regression 1 P y x logistic x w 2 Define the loss function LossD w log P yi xi w i 3 Differentiate the loss function and use gradient descent to minimize Start with w0 For t 1 T until convergence For each example x y in D wt 1 wt Loss x y w t where is small More steps noisier path toward the minimum but each step is 22 LEARNING LINEAR CLASSIFIERS WITH GRADIENT DESCENT PART 2 THE GRADIENT I will start by deriving the gradient for one example eg SGD and then move to batch gradient descent 23 oss on one example is We re going to dive into this thing here d dw p 24 f n nf n 1 f p 25 26 27 Breaking it down SGD for logistic regression 1 P y x logistic x w LossD w log P yi xi w 2 Define the loss function i 3 Differentiate the loss function and use gradient descent Start with w0 For t 1 T until convergence For each example x y in D pi 1 exp x w 1 wt 1 wt Loss x y w t wt l where is small y pi x 28 OBSERVATIONS ABOUT LOGISTIC REGRESSION 29 Sparsity and logistic regression Key computational point if xj 0 then the gradient of wj is zero so when processing an example you only need to update weights for the nonzero features of an example Similar computational tricks work for na ve Bayes and the perceptron algorithm 30 Perceptron updates and LR updates Weight updates are very similar wt 1 wt l y pi x y 0 1 wt 1 wt y x mistakes only y 1 1 After a false negative predict false really true wt 1 …
View Full Document
Unlocking...