10 601 Machine Learning Logistic regression Back to classification 1 Instance based classifiers Use observation directly no models e g K nearest neighbors 2 Generative build a generative statistical model e g Bayesian networks 3 Discriminative directly estimate a decision rule boundary e g decision tree Generative vs discriminative classifiers When using generative classifiers we relied on all points to learn the generative model When using discriminative classifiers we mainly care about the boundary Generative model Y Discriminative model Y X X Regression for classification In some cases we can use linear regression for determining the appropriate boundary However since the output is usually binary or discrete there are more efficient regression methods Recall that for classification we are interested in the conditional probability p y x where are the parameters of our model When using regression represents the values of our regression coefficients w Regression for classification Assume we would like to use linear regression to learn the parameters for p y x Problems wTx 0 classify as 1 wTx 0 classify as 1 1 Optimal regression model 1 The sigmoid function p y x To classify using regression models we replace the linear function with the sigmoid function Always between 0 and 1 1 g h 1 e h Using the sigmoid we set for binary classification problems p y 0 x g w T x 1 1 e p y 1 x 1 g w x T wT x e wT x 1 e wT x The sigmoid function p y x To classify using regression models we replace the linear function with the sigmoid function 1 g h 1 e h Using the sigmoid we set for binary classification problems p y 0 x g w T x 1 1 e p y 1 x 1 g w x T wT x e wT x 1 e wT x Note that we are defining the probabilities in terms of p y x No need to use Bayes rule here Logistic regression vs Linear regression p y 0 x g w T x 1 1 e p y 1 x 1 g w x T wT x ew T 1 e x wT x Determining parameters for logistic regression problems So how do we find the parameters p y 0 x g x w 1 1 ew x ew x p y 1 x 1 g x w 1 ew x Similar to other regression problems we look for the MLE for w T T T The likelihood of the data given the model is L y x w 1 g x w g x w i i yi i 1 y i Solving logistic regression problems g x w 1 1 ew 1 g x w The likelihood of the data is L y x w 1 g x w i yi i Taking the log we get N LL y x w y i ln 1 g x i w 1 y i ln g x i w i 1 g x w i y i ln ln g x w i i 1 g x w N y w x ln 1 e N i 1 i T i wT xi x wT x 1 e wT x 1 y i g x w i i 1 e T Maximum likelihood estimation N T i l w y iw T x i ln 1 e w x w j w j i 1 x ij y i 1 g x i w N i 1 x y p y 1 x w N i 1 i j i g x w 1 1 ew 1 g x w e T x wT x 1 e i Bad news No close form solution Good news Concave function wT x Gradient ascent w Slope z w z x y g w x z w Going in the direction to the slope will lead to a larger z But not too much otherwise we would go beyond the optimal w Gradient descent z f w y 2 Slope z w z w w Going in the opposite direction to the slope will lead to a smaller z But not too much otherwise we would go beyond the optimal w Gradient ascent for logistic regression N l w x ij y i 1 g x i w i 1 w j We use the gradient to adjust the value of w w j w j i 1 xij y i 1 g xi w N Where is a small constant Example Algorithm for logistic regression 1 Chose 2 Start with a guess for w 3 For all j set w j w j i 1 xij y i 1 g xi w 4 If no improvement for N n y i 1 stop Otherwise go to step 3 i 1 g x i w 2 Regularization Like with other data estimation problems we may not have enough data to learn good models One way to overcome this is to regularize the model impose additional constraints on the parameters we are fitting For example lets assume that wi comes from a Guassian distribution with mean 0 and variance where is a user defined parameter wi N 0 In that case we have p y 1 x p y 1 x p Regularization If we regularize the parameters we need to take the prior into account when computing the posterior for our parameters p y 1 x p y 1 x p Here we use a Gaussian model for the prior Thus the log likelihood changes to 2 j N w i T i wT xi LL y w x y w x ln 1 e i 1 2 j After removing terms that are not dependent on w And the new update rule after taking the derivative w r t wi is w j w j x y 1 g x w N i 1 Also known as the MAP estimate i j i i wj The variance of our prior model Regularization There are many other ways to regularize logistic regression The Gaussian model leads to an L2 regularization we are trying to minimize the square of w Another popular regularization is an L1 which tries to minimize w This often leads to many wj s being 0 resulting in compact models The importance of the regularization parameter Too small does not have a big impact Too large overrides the data An example of the training test conditional log likelihoods as a function of the regularization parameter Average log likelihood for data only Logistic regression for more than 2 classes Logistic regression can be used to classify data from more than 2 classes for i k we set p y i x g wi0 wi1x1 where g zi e zi k 1 1 e zj wid xd g w Ti x zi wi0 wi1 x1 wid x d j 1 k 1 And for k we have p y k x 1 p y i x i 1 p y k x 1 k 1 1 e j 1 zj Logistic regression for more than 2 classes Logistic …
View Full Document
Unlocking...