DOC PREVIEW
CMU CS 10701 - Lecture

This preview shows page 1-2-21-22 out of 22 pages.

Save
View full document
Premium Document
Do you want full access? Go Premium and unlock all 22 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Machine Learning 1010 701 15701 15 781 Fall 2006 Overfitting and Model Selection Eric Xing Lecture 7 October 3 2006 Reading Chap 1 2 CB Chap 5 6 TM Outline z Overfitting z Instance based learning z Regression z Bias variance decomposition z The battle against overfitting each learning algorithm has some free knobs that one can tune i e heck to make the algorithm generalizes better to test data But is there a more principled way z Cross validation z Regularization z Model selection Occam s razor z Model averaging z The Bayesian frequentist debate z Bayesian learning weight models by their posterior probabilities 1 Recall Vector Space Representation z Each document is a vector one component for each term word Doc 1 Doc 2 Doc 3 Word 1 3 0 0 Word 2 0 8 1 Word 3 12 1 10 0 1 3 0 0 0 z Normalize to unit length z High dimensional vector space z Terms are axes 10 000 dimensions or even 100 000 z Docs are vectors in this space Classes in a Vector Space Sports Science Arts 2 Test Document Sports Science Arts K Nearest Neighbor kNN classifier Sports Science Arts 3 kNN is an instance of Instance Based Learning z What makes an Instance Based Learner z A distance metric z How many nearby neighbors to look at z A weighting function optional z How to relate to the local points Euclidean Distance Metric D x x 2 i xi xi 2 i z Or equivalently D x x x x T x x z Other metrics z L1 norm x x z L norm max x x elementwise z Mahalanobis where is full and symmetric z Correlation z Angle z Hamming distance Manhattan distance z 4 1 Nearest Neighbor kNN classifier Sports Science Arts 2 Nearest Neighbor kNN classifier Sports Science Arts 5 3 Nearest Neighbor kNN classifier Sports Science Arts 5 Nearest Neighbor kNN classifier Sports Science Arts 6 Nearest Neighbor Learning Algorithm z Learning is just storing the representations of the training examples in D z Testing instance x z z Compute similarity between x and all examples in D Assign x the category of the most similar example in D z Does not explicitly compute a generalization or category prototypes z Also called z z z Case based learning Memory based learning Lazy learning kNN Is Close to Optimal z Cover and Hart 1967 z Asymptotically the error rate of 1 nearest neighbor classification is less than twice the Bayes rate error rate of classifier knowing model that generated data z In particular asymptotic error rate is 0 if Bayes rate is 0 z Decision boundary 7 Overfitting Another example z Regression 8 Overfitting con d z The models z Test errors Bias variance decomposition z Now let s look more closely into two sources of errors in an functional approximator z In the following we show the Bias variance decomposition using LR as an example 9 Loss functions for regression z Let t be the true target output and y x be our estimate The expected squared loss is E L L t y x p x t dxdt t y x 2 p x t dxdt z Out goal is to choose y x that minimize E L z Calculus of variations E L 2 t y x p x t dt 0 y x y x p x t dt tp x t dt y x tp x t dt tp t x dt Et x t E t x p x Expected loss z Let h x E t x be the optimal predictor and y x our actual predictor which will incur the following expected loss E y x t 2 y x h x h x t p x t dxdt 2 y x h x 2 y x h x h x t h x t p x t dxdt 2 2 y x h x p x dx h x t p x t dxdt 2 2 There is an error on pp47 z h x t 2 p x t dxdt is a noisy term and we can do no better than this Thus it is a lower bound of the expected loss y x h x 2 z The other part of the error come from take a close look of it p x dx and let s z We will assume y x y x w is a parametric model and the parameters w are fit to a training set D thus we write y x D 10 Bias variance decomposition z For one data set D and one test point x z since the predictor y depend on the data training data D write ED y x D for the expected predictor over the ensemble of datasets then using the same trick we have y x D h x 2 y x D ED y x D ED y x D h x 2 2 2 y x D ED y x D E D y x D h x 2 y x D ED y x D E D y x D h x z Surely this error term depends on the training data so we take an expectation over them E D y x D h x ED y x D h x E D y x D E D y x D z 2 2 2 Putting things together expected loss bias 2 variance noise Regularized Regression 11 Bias variance tradeoff z is a regularization terms in LR the smaller the is more complex the model why z Simple highly regularized models have low variance but high bias z Complex models have low bias but high variance z You are inspecting an empirical average over 100 training set z The actual ED can not be computed Bias2 variance vs regularizer z Bias2 variance predicts shape of test error quite well z However bias and variance cannot be computed since it relies on knowing the true distribution of x and t and hence h x E t x 12 The battle against overfitting Model Selection z Suppose we are trying select among several different models for a learning problem z Examples 1 polynomial regression h x g 0 1 x 2 x 2 K k x k z Model selection we wish to automatically and objectively decide if k should be say 0 1 or 10 2 locally weighted regression z Model selection we want to automatically choose the bandwidth parameter 3 Mixture models and hidden Markov model z z Model selection we want to decide the number of hidden states The Problem z Given model family F M 1 M 2 K M I find M i F s t M i arg max J D M M F 13 Cross Validation z We are given training data D and test data Dtest and we would like to fit this data with a model pi x from the family F e g an LR which is indexed by i and parameterized by z K fold cross validation CV z z z Set …


View Full Document

CMU CS 10701 - Lecture

Documents in this Course
lecture

lecture

12 pages

lecture

lecture

17 pages

HMMs

HMMs

40 pages

lecture

lecture

15 pages

lecture

lecture

20 pages

Notes

Notes

10 pages

Notes

Notes

15 pages

Lecture

Lecture

22 pages

Lecture

Lecture

13 pages

Lecture

Lecture

24 pages

Lecture9

Lecture9

38 pages

lecture

lecture

26 pages

lecture

lecture

13 pages

Lecture

Lecture

5 pages

lecture

lecture

18 pages

lecture

lecture

22 pages

Boosting

Boosting

11 pages

lecture

lecture

16 pages

lecture

lecture

20 pages

Lecture

Lecture

20 pages

Lecture

Lecture

39 pages

Lecture

Lecture

14 pages

Lecture

Lecture

18 pages

Lecture

Lecture

13 pages

Exam

Exam

10 pages

Lecture

Lecture

27 pages

Lecture

Lecture

15 pages

Lecture

Lecture

24 pages

Lecture

Lecture

16 pages

Lecture

Lecture

23 pages

Lecture6

Lecture6

28 pages

Notes

Notes

34 pages

lecture

lecture

15 pages

Midterm

Midterm

11 pages

lecture

lecture

11 pages

lecture

lecture

23 pages

Boosting

Boosting

35 pages

Lecture

Lecture

49 pages

Lecture

Lecture

16 pages

Lecture

Lecture

18 pages

Lecture

Lecture

35 pages

lecture

lecture

22 pages

lecture

lecture

24 pages

Midterm

Midterm

17 pages

exam

exam

15 pages

Lecture12

Lecture12

32 pages

lecture

lecture

19 pages

Lecture

Lecture

32 pages

boosting

boosting

11 pages

pca-mdps

pca-mdps

56 pages

bns

bns

45 pages

mdps

mdps

42 pages

svms

svms

10 pages

Notes

Notes

12 pages

lecture

lecture

42 pages

lecture

lecture

29 pages

lecture

lecture

15 pages

Lecture

Lecture

12 pages

Lecture

Lecture

24 pages

Lecture

Lecture

22 pages

Midterm

Midterm

5 pages

mdps-rl

mdps-rl

26 pages

Load more
Download Lecture
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?