What s learning revisited Overfitting Generative versus Discriminative Logistic Regression Machine Learning 10701 15781 Carlos Guestrin Carnegie Mellon University September 19th 2007 Carlos Guestrin 2005 2007 Bias Variance Tradeoff Choice of hypothesis class introduces learning bias More complex class less bias More complex class more variance Carlos Guestrin 2005 2007 1 Training set error Given a dataset Training data Choose a loss function e g squared error L2 for regression Training set error For a particular set of parameters loss function on training data Carlos Guestrin 2005 2007 Training set error as a function of model complexity Carlos Guestrin 2005 2007 2 Prediction error Training set error can be poor measure of quality of solution Prediction error We really care about error over all possible input points not just training data Carlos Guestrin 2005 2007 Prediction error as a function of model complexity Carlos Guestrin 2005 2007 3 Computing prediction error Computing prediction hard integral May not know t x for every x Monte Carlo integration sampling approximation Sample a set of i i d points x1 xM from p x Approximate integral with sample average Carlos Guestrin 2005 2007 Why training set error doesn t approximate prediction error Sampling approximation of prediction error Training error Very similar equations Why is training set a bad measure of prediction error Carlos Guestrin 2005 2007 4 Why training set error doesn t approximate prediction error Sampling approximation prediction error Because of you cheated Training error good estimate for a single w But you optimized w with respect to the training error and found w that is good for this set of samples Training error Training error is a optimistically biased estimate of prediction error Very similar equations Why is training set a bad measure of prediction error Carlos Guestrin 2005 2007 Test set error Given a dataset randomly split it into two parts Training data x1 xNtrain Test data x 1 xNtest Use training data to optimize parameters w Test set error For the final solution w evaluate the error using Carlos Guestrin 2005 2007 5 Test set error as a function of model complexity Carlos Guestrin 2005 2007 Overfitting Overfitting a learning algorithm overfits the training data if it outputs a solution w when there exists another solution w such that Carlos Guestrin 2005 2007 6 How many points to I use for training testing Very hard question to answer Too few training points learned w is bad Too few test points you never know if you reached a good solution Bounds such as Hoeffding s inequality can help More on this later this semester but still hard to answer Typically if you have a reasonable amount of data pick test set large enough for a reasonable estimate of error and use the rest for learning if you have little data then you need to pull out the big guns e g bootstrapping Carlos Guestrin 2005 2007 Error estimators Carlos Guestrin 2005 2007 7 Error as a function of number of training examples for a fixed model complexity little data infinite data Carlos Guestrin 2005 2007 Error estimators Be careful Test set only unbiased if you never never never never do any any any any learning on the test data For example if you use the test set to select the degree of the polynomial no longer unbiased We will address this problem later in the semester Carlos Guestrin 2005 2007 8 Announcements First homework is out Programming part and Analytic part Remember collaboration policy can discuss questions but need to write your own solutions and code Remember you are not allowed to look at previous years solutions search the web for solutions use someone else s solutions etc Due Oct 3rd beginning of class Start early Recitation this week Bayes optimal classifiers Na ve Bayes Carlos Guestrin 2005 2007 What s supervised learning more formally Given Dataset Instances x1 t x1 xN t xN Hypothesis space H e g polynomials of degree 8 Loss function measures quality of hypothesis h2H e g xi t xi GPA 3 9 IQ 120 MLscore 99 150K e g squared error for regression Obtain Learning algorithm obtain h2H that minimizes loss function e g using matrix operations for regression Want to minimize prediction error but can only minimize error in dataset Carlos Guestrin 2005 2007 9 Types of supervised learning problems revisited Regression e g Density estimation e g dataset position temperature hypothesis space Loss function dataset grades hypothesis space Loss function Classification e g dataset brain image verb v noun hypothesis space Loss function Carlos Guestrin 2005 2007 Learning is simply function approximation The general supervised learning problem Given some data including features hypothesis space loss function Learning is no magic Simply trying to find a function that fits the data Regression Density estimation Classification Not surprisingly Seemly different problem very similar solutions Carlos Guestrin 2005 2007 10 What is NB really optimizing Na ve Bayes assumption Features More are independent given class generally NB Classifier Carlos Guestrin 2005 2007 MLE for the parameters of NB Given dataset Count A a B b number of examples where A a and B b MLE for NB simply Prior P Y y Likelihood P Xi xi Yi yi Carlos Guestrin 2005 2007 11 What is NB really optimizing Let s use an example Carlos Guestrin 2005 2007 Generative v Discriminative classifiers Intuition Want to Learn h X a Y X features Y target classes Bayes optimal classifier P Y X Generative classifier e g Na ve Bayes Assume some functional form for P X Y P Y Estimate parameters of P X Y P Y directly from training data Use Bayes rule to calculate P Y X x This is a generative model Indirect computation of P Y X through Bayes rule But can generate a sample of the data P X y P y P X y Discriminative classifiers e g Logistic Regression Assume some functional form for P Y X Estimate parameters of P Y X directly from training data This is the discriminative model Directly learn P Y X But cannot obtain a sample of the data because P X is not available Carlos Guestrin 2005 2007 12 Logistic function or Sigmoid Logistic Regression Learn P Y X directly Assume a particular functional form Sigmoid applied to a linear function of the data Z Features can be discrete or continuous Carlos Guestrin 2005 2007 Understanding the sigmoid w0 2 w1 1 w0 0 w1 1 w0 0 w1 0 5 1 1 1 0 9 0 9 0 9 0 8 0 8 0 8 0 7 0 7 0 7 0 6 0 6 0 6 0 5 0 5 0 5 0 4 0 4 0 4 0 3 0 3 0 3 0 2 0 2 0 2 0 1 0 1 0 6 4 2 0 2 4 6 0 6 0 1 4 2 0 2 4
View Full Document