Machine Learning 1010 701 15701 15 781 Spring 2008 Overfitting and Model Selection Eric Xing Reading Chap 1 2 CB Chap 5 6 TM Lecture 13 February 27 2008 Outline z Overfitting z Instance based learning z Regression z Bias variance decomposition z The battle against overfitting each learning algorithm has some free knobs that one can tune i e heck to make the algorithm generalizes better to test data But is there a more principled way z Cross validation z Regularization z Model selection Occam s razor z Model averaging z The Bayesian frequentist debate z Bayesian learning weight models by their posterior probabilities 1 Recall Vector Space Representation z Each document is a vector one component for each term word Doc 1 Doc 2 Doc 3 Word 1 3 0 0 Word 2 0 8 1 Word 3 12 1 10 0 1 3 0 0 0 z Normalize to unit length z High dimensional vector space z Terms are axes 10 000 dimensions or even 100 000 z Docs are vectors in this space Classes in a Vector Space Sports Science Arts 2 Test Document Sports Science Arts K Nearest Neighbor kNN classifier Sports Science Arts 3 Nearest Neighbor Learning Algorithm z Learning is just storing the representations of the training examples in D z Testing instance x z z Compute similarity between x and all examples in D Assign x the category of the most similar example in D z Does not explicitly compute a generalization or category prototypes z Also called z z z Case based learning Memory based learning Lazy learning Overfitting 4 Another example z Regression Overfitting con d z The models z Test errors 5 What is a good model Low quality High Robustness Low Robustness LEGEND Robust Model Model built Known Data New Data Bias variance decomposition z Now let s look more closely into two sources of errors in an functional approximator z In the following we show the Bias variance decomposition using LR as an example 6 Loss functions for regression z Let t be the true target output and y x be our estimate The expected squared loss is E L L t y x p x t dxdt t y x 2 p x t dxdt z Out goal is to choose y x that minimize E L z Calculus of variations E L 2 t y x p x t dt 0 y x y x p x t dt tp x t dt y x tp x t dt tp t x dt Et x t E t x p x Expected loss z Let h x E t x be the optimal predictor and y x our actual predictor which will incur the following expected loss E y x t 2 y x h x h x t p x t dxdt 2 y x h x 2 y x h x h x t h x t p x t dxdt 2 2 y x h x p x dx h x t p x t dxdt 2 2 There is an error on pp47 z h x t 2 p x t dxdt is a noisy term and we can do no better than this Thus it is a lower bound of the expected loss y x h x 2 z The other part of the error come from take a close look of it p x dx and let s z We will assume y x y x w is a parametric model and the parameters w are fit to a training set D thus we write y x D 7 Bias variance decomposition z For one data set D and one test point x z since the predictor y depend on the data training data D write ED y x D for the expected predictor over the ensemble of datasets then using the same trick we have y x D h x 2 y x D ED y x D ED y x D h x 2 2 2 y x D ED y x D E D y x D h x 2 y x D ED y x D E D y x D h x z Surely this error term depends on the training data so we take an expectation over them E D y x D h x ED y x D h x E D y x D E D y x D z 2 2 2 Putting things together expected loss bias 2 variance noise Regularized Regression 8 Bias variance tradeoff z is a regularization terms in LR the smaller the is more complex the model why z Simple highly regularized models have low variance but high bias z Complex models have low bias but high variance z You are inspecting an empirical average over 100 training set z The actual ED can not be computed Bias2 variance vs regularizer z Bias2 variance predicts shape of test error quite well z However bias and variance cannot be computed since it relies on knowing the true distribution of x and t and hence h x E t x 9 The battle against overfitting Model Selection z Suppose we are trying select among several different models for a learning problem z Examples 1 polynomial regression h x g 0 1 x 2 x 2 K k x k z Model selection we wish to automatically and objectively decide if k should be say 0 1 or 10 2 locally weighted regression z Model selection we want to automatically choose the bandwidth parameter 3 Mixture models and hidden Markov model z z Model selection we want to decide the number of hidden states The Problem z Given model family F M 1 M 2 K M I find M i F s t M i arg max J D M M F 10 Cross Validation z We are given training data D and test data Dtest and we would like to fit this data with a model pi x from the family F e g an LR which is indexed by i and parameterized by z K fold cross validation CV z z z Set aside N samples of D where N D This is known as the held out data and will be used to evaluate different values of i For each candidate model i fit the optimal hypothesis pi x to the remaining 1 N samples in D i e hold i fixed and find the best Evaluate each model pi x on the held out data using some pre specified risk function z Repeat the above K times choosing a different held out data set each time and the scores are averaged for each model pi over all held out data set This gives an estimate of the risk curve of models over different i z For the model with the lowest rish say pi we use all of D to find the parameter values for pi x Example z When 1 N the algorithm is known as Leave One OutCross Validation LOOCV MSELOOCV M1 2 12 MSELOOCV M2 0 962 11 Practical issues …
View Full Document