Logistic Regression cont Decision Trees Machine Learning 10701 15781 Carlos Guestrin Carnegie Mellon University September 26th 2007 1 Carlos Guestrin 2005 2007 Logistic Regression Logistic function or Sigmoid Learn P Y X directly Assume a particular functional form Sigmoid applied to a linear function of the data Z Features can be discrete or continuous 2 Carlos Guestrin 2005 2007 1 Loss functions Likelihood v Conditional Likelihood Generative Na ve Bayes Loss function Data likelihood Discriminative models cannot compute P xj w But discriminative logistic regression loss function Conditional Data Likelihood Doesn t waste effort learning P X focuses on P Y X all that matters for classification 3 Carlos Guestrin 2005 2007 Optimizing concave function Gradient ascent Conditional likelihood for Logistic Regression is concave Find optimum with gradient ascent Gradient Learning rate 0 Update rule Gradient ascent is simplest of optimization approaches e g Conjugate gradient ascent much better see reading 4 Carlos Guestrin 2005 2007 2 Gradient Descent for LR Gradient ascent algorithm iterate until change For i 1 n repeat Equation is correct in the last lecture I inadvertently changed the notation to Sorry about the change both definitions are really equivalent the equations on this slide are Carlos Guestrin 2005 2007 consistent with this definition 5 That s all M C LE How about MAP One common approach is to define priors on w Normal distribution zero mean identity covariance Pushes parameters towards zero Corresponds to Regularization Helps avoid very large weights and overfitting More on this later in the semester MAP estimate 6 Carlos Guestrin 2005 2007 3 M C AP as Regularization Penalizes high weights also applicable in linear regression 7 Carlos Guestrin 2005 2007 Large parameters Overfitting QuickTime and a TIFF Uncompressed decompressor are needed to see this picture QuickTime and a TIFF Uncompressed decompressor are needed to see this picture QuickTime and a TIFF Uncompressed decompressor are needed to see this picture If data is linearly separable weights go to infinity Leads to overfitting Penalizing high weights can prevent overfitting again more on this later in the semester 8 Carlos Guestrin 2005 2007 4 Gradient of M C AP 9 Carlos Guestrin 2005 2007 MLE vs MAP Maximum conditional likelihood estimate Maximum conditional a posteriori estimate 10 Carlos Guestrin 2005 2007 5 G Na ve Bayes vs Logistic Regression 1 Ng Jordan 2002 Generative and Discriminative classifiers focuses on setting when GNB leads to linear classifier variance i depends on feature i not on class k Asymptotic comparison training examples infinity when GNB model correct GNB LR produce identical classifiers when model incorrect LR is less biased does not assume conditional independence therefore LR expected to outperform GNB 11 Carlos Guestrin 2005 2007 G Na ve Bayes vs Logistic Regression 2 Ng Jordan 2002 Generative and Discriminative classifiers focuses on setting when GNB leads to linear classifier Non asymptotic analysis convergence rate of parameter estimates n of attributes in X Size of training data to get close to infinite data solution GNB needs O log n samples LR needs O n samples GNB converges more quickly to its perhaps less helpful asymptotic estimates 12 Carlos Guestrin 2005 2007 6 Na ve bayes Logistic Regression Some experiments from UCI data sets 13 Carlos Guestrin 2005 2007 What you should know about Logistic Regression LR Gaussian Na ve Bayes with class independent variances representationally equivalent to LR In general NB and LR make different assumptions Solution differs because of objective loss function NB Features independent given class assumption on P X Y LR Functional form of P Y X no assumption on P X Y LR is a linear classifier decision rule is a hyperplane LR optimized by conditional likelihood no closed form solution concave global optimum with gradient ascent Maximum conditional a posteriori corresponds to regularization Convergence rates GNB usually needs less data LR usually gets to better solutions in the limit 14 Carlos Guestrin 2005 2007 7 Linear separability A dataset is linearly separable iff a separating hyperplane w such that w0 i wi xi 0 if x x1 xn is a positive example w0 i wi xi 0 if x x1 xn is a negative example 15 Carlos Guestrin 2005 2007 Not linearly separable data Some datasets are not linearly separable 16 Carlos Guestrin 2005 2007 8 Addressing non linearly separable data Option 1 non linear features Choose non linear features e g Typical linear features w0 i wi xi Example of non linear features Degree 2 polynomials w0 i wi xi ij wij xi xj Classifier hw x still linear in parameters w As easy to learn Data is linearly separable in higher dimensional spaces More discussion later this semester 17 Carlos Guestrin 2005 2007 Addressing non linearly separable data Option 2 non linear classifier Choose a classifier hw x that is non linear in parameters w e g Decision trees neural networks nearest neighbor More general than linear classifiers But can often be harder to learn non convex concave optimization required But but often very useful BTW Later this semester we ll see that these options are not that different 18 Carlos Guestrin 2005 2007 9 A small dataset Miles Per Gallon mpg Suppose we want to predict MPG cylinders displacement horsepower good bad bad bad bad bad bad bad bad good bad good bad good good bad good bad 4 6 4 8 6 4 4 8 8 8 8 4 6 4 4 8 4 5 low medium medium high medium low low high high high high low medium medium low high low medium low medium medium high medium medium medium high high medium high low medium low low high medium medium weight acceleration modelyear maker low medium medium high medium low low high high high high low medium low medium high low medium high medium low low medium medium low low low high low low high low high low medium medium 75to78 70to74 75to78 70to74 70to74 70to74 70to74 75to78 70to74 79to83 75to78 79to83 75to78 79to83 79to83 70to74 75to78 75to78 asia america europe america america asia asia america america america america america america america america america europe europe 40 Records From the UCI repository thanks to Ross Quinlan 19 Carlos Guestrin 2005 2007 A Decision Stump 20 Carlos Guestrin 2005 2007 10 Recursion Step Records in which cylinders 4 Take the Original Dataset And partition it according to the value of the attribute we split on Records in which cylinders 5 Records in which cylinders 6 Records in which
View Full Document