Machine Learning 1010 701 15701 15 781 Fall 2006 Overfitting and Model Selection Eric Xing Lecture 7 October 3 2006 Reading Chap 1 2 CB Chap 5 6 TM Outline z Overfitting z Instance based learning z Regression z Bias variance decomposition z The battle against overfitting each learning algorithm has some free knobs that one can tune i e heck to make the algorithm generalizes better to test data But is there a more principled way z Cross validation z Regularization z Model selection Occam s razor z Model averaging z The Bayesian frequentist debate z Bayesian learning weight models by their posterior probabilities 1 Recall Vector Space Representation z Each document is a vector one component for each term word Doc 1 Doc 2 Doc 3 Word 1 3 0 0 Word 2 0 8 1 Word 3 12 1 10 0 1 3 0 0 0 z Normalize to unit length z High dimensional vector space z Terms are axes 10 000 dimensions or even 100 000 z Docs are vectors in this space Classes in a Vector Space Sports Science Arts 2 Test Document Sports Science Arts K Nearest Neighbor kNN classifier Sports Science Arts 3 kNN is an instance of Instance Based Learning z What makes an Instance Based Learner z A distance metric z How many nearby neighbors to look at z A weighting function optional z How to relate to the local points Euclidean Distance Metric D x x 2 i xi xi 2 i z Or equivalently D x x x x T x x z Other metrics z L1 norm x x z L norm max x x elementwise z Mahalanobis where is full and symmetric z Correlation z Angle z Hamming distance Manhattan distance z 4 1 Nearest Neighbor kNN classifier Sports Science Arts 2 Nearest Neighbor kNN classifier Sports Science Arts 5 3 Nearest Neighbor kNN classifier Sports Science Arts 5 Nearest Neighbor kNN classifier Sports Science Arts 6 Nearest Neighbor Learning Algorithm z Learning is just storing the representations of the training examples in D z Testing instance x z z Compute similarity between x and all examples in D Assign x the category of the most similar example in D z Does not explicitly compute a generalization or category prototypes z Also called z z z Case based learning Memory based learning Lazy learning kNN Is Close to Optimal z Cover and Hart 1967 z Asymptotically the error rate of 1 nearest neighbor classification is less than twice the Bayes rate error rate of classifier knowing model that generated data z In particular asymptotic error rate is 0 if Bayes rate is 0 z Decision boundary 7 Overfitting Another example z Regression 8 Overfitting con d z The models z Test errors Bias variance decomposition z Now let s look more closely into two sources of errors in an functional approximator z In the following we show the Bias variance decomposition using LR as an example 9 Loss functions for regression z Let t be the true target output and y x be our estimate The expected squared loss is E L L t y x p x t dxdt t y x 2 p x t dxdt z Out goal is to choose y x that minimize E L z Calculus of variations E L 2 t y x p x t dt 0 y x y x p x t dt tp x t dt y x tp x t dt tp t x dt Et x t E t x p x Expected loss z Let h x E t x be the optimal predictor and y x our actual predictor which will incur the following expected loss E y x t 2 y x h x h x t p x t dxdt 2 y x h x 2 y x h x h x t h x t p x t dxdt 2 2 y x h x p x dx h x t p x t dxdt 2 2 There is an error on pp47 z h x t 2 p x t dxdt is a noisy term and we can do no better than this Thus it is a lower bound of the expected loss y x h x 2 z The other part of the error come from take a close look of it p x dx and let s z We will assume y x y x w is a parametric model and the parameters w are fit to a training set D thus we write y x D 10 Bias variance decomposition z For one data set D and one test point x z since the predictor y depend on the data training data D write ED y x D for the expected predictor over the ensemble of datasets then using the same trick we have y x D h x 2 y x D ED y x D ED y x D h x 2 2 2 y x D ED y x D E D y x D h x 2 y x D ED y x D E D y x D h x z Surely this error term depends on the training data so we take an expectation over them E D y x D h x ED y x D h x E D y x D E D y x D z 2 2 2 Putting things together expected loss bias 2 variance noise Regularized Regression 11 Bias variance tradeoff z is a regularization terms in LR the smaller the is more complex the model why z Simple highly regularized models have low variance but high bias z Complex models have low bias but high variance z You are inspecting an empirical average over 100 training set z The actual ED can not be computed Bias2 variance vs regularizer z Bias2 variance predicts shape of test error quite well z However bias and variance cannot be computed since it relies on knowing the true distribution of x and t and hence h x E t x 12 The battle against overfitting Model Selection z Suppose we are trying select among several different models for a learning problem z Examples 1 polynomial regression h x g 0 1 x 2 x 2 K k x k z Model selection we wish to automatically and objectively decide if k should be say 0 1 or 10 2 locally weighted regression z Model selection we want to automatically choose the bandwidth parameter 3 Mixture models and hidden Markov model z z Model selection we want to decide the number of hidden states The Problem z Given model family F M 1 M 2 K M I find M i F s t M i arg max J D M M F 13 Cross Validation z We are given training data D and test data Dtest and we would like to fit this data with a model pi x from the family F e g an LR which is indexed by i and parameterized by z K fold cross validation CV z z z Set …
View Full Document