DOC PREVIEW
CMU CS 10701 - lecture

This preview shows page 1-2-3-4-5-6 out of 19 pages.

Save
View full document
Premium Document
Do you want full access? Go Premium and unlock all 19 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Machine Learning 1010 701 15701 15 781 Spring 2008 Overfitting and Model Selection Eric Xing Reading Chap 1 2 CB Chap 5 6 TM Lecture 13 February 27 2008 Outline z Overfitting z Instance based learning z Regression z Bias variance decomposition z The battle against overfitting each learning algorithm has some free knobs that one can tune i e heck to make the algorithm generalizes better to test data But is there a more principled way z Cross validation z Regularization z Model selection Occam s razor z Model averaging z The Bayesian frequentist debate z Bayesian learning weight models by their posterior probabilities 1 Recall Vector Space Representation z Each document is a vector one component for each term word Doc 1 Doc 2 Doc 3 Word 1 3 0 0 Word 2 0 8 1 Word 3 12 1 10 0 1 3 0 0 0 z Normalize to unit length z High dimensional vector space z Terms are axes 10 000 dimensions or even 100 000 z Docs are vectors in this space Classes in a Vector Space Sports Science Arts 2 Test Document Sports Science Arts K Nearest Neighbor kNN classifier Sports Science Arts 3 Nearest Neighbor Learning Algorithm z Learning is just storing the representations of the training examples in D z Testing instance x z z Compute similarity between x and all examples in D Assign x the category of the most similar example in D z Does not explicitly compute a generalization or category prototypes z Also called z z z Case based learning Memory based learning Lazy learning Overfitting 4 Another example z Regression Overfitting con d z The models z Test errors 5 What is a good model Low quality High Robustness Low Robustness LEGEND Robust Model Model built Known Data New Data Bias variance decomposition z Now let s look more closely into two sources of errors in an functional approximator z In the following we show the Bias variance decomposition using LR as an example 6 Loss functions for regression z Let t be the true target output and y x be our estimate The expected squared loss is E L L t y x p x t dxdt t y x 2 p x t dxdt z Out goal is to choose y x that minimize E L z Calculus of variations E L 2 t y x p x t dt 0 y x y x p x t dt tp x t dt y x tp x t dt tp t x dt Et x t E t x p x Expected loss z Let h x E t x be the optimal predictor and y x our actual predictor which will incur the following expected loss E y x t 2 y x h x h x t p x t dxdt 2 y x h x 2 y x h x h x t h x t p x t dxdt 2 2 y x h x p x dx h x t p x t dxdt 2 2 There is an error on pp47 z h x t 2 p x t dxdt is a noisy term and we can do no better than this Thus it is a lower bound of the expected loss y x h x 2 z The other part of the error come from take a close look of it p x dx and let s z We will assume y x y x w is a parametric model and the parameters w are fit to a training set D thus we write y x D 7 Bias variance decomposition z For one data set D and one test point x z since the predictor y depend on the data training data D write ED y x D for the expected predictor over the ensemble of datasets then using the same trick we have y x D h x 2 y x D ED y x D ED y x D h x 2 2 2 y x D ED y x D E D y x D h x 2 y x D ED y x D E D y x D h x z Surely this error term depends on the training data so we take an expectation over them E D y x D h x ED y x D h x E D y x D E D y x D z 2 2 2 Putting things together expected loss bias 2 variance noise Regularized Regression 8 Bias variance tradeoff z is a regularization terms in LR the smaller the is more complex the model why z Simple highly regularized models have low variance but high bias z Complex models have low bias but high variance z You are inspecting an empirical average over 100 training set z The actual ED can not be computed Bias2 variance vs regularizer z Bias2 variance predicts shape of test error quite well z However bias and variance cannot be computed since it relies on knowing the true distribution of x and t and hence h x E t x 9 The battle against overfitting Model Selection z Suppose we are trying select among several different models for a learning problem z Examples 1 polynomial regression h x g 0 1 x 2 x 2 K k x k z Model selection we wish to automatically and objectively decide if k should be say 0 1 or 10 2 locally weighted regression z Model selection we want to automatically choose the bandwidth parameter 3 Mixture models and hidden Markov model z z Model selection we want to decide the number of hidden states The Problem z Given model family F M 1 M 2 K M I find M i F s t M i arg max J D M M F 10 Cross Validation z We are given training data D and test data Dtest and we would like to fit this data with a model pi x from the family F e g an LR which is indexed by i and parameterized by z K fold cross validation CV z z z Set aside N samples of D where N D This is known as the held out data and will be used to evaluate different values of i For each candidate model i fit the optimal hypothesis pi x to the remaining 1 N samples in D i e hold i fixed and find the best Evaluate each model pi x on the held out data using some pre specified risk function z Repeat the above K times choosing a different held out data set each time and the scores are averaged for each model pi over all held out data set This gives an estimate of the risk curve of models over different i z For the model with the lowest rish say pi we use all of D to find the parameter values for pi x Example z When 1 N the algorithm is known as Leave One OutCross Validation LOOCV MSELOOCV M1 2 12 MSELOOCV M2 0 962 11 Practical issues …


View Full Document

CMU CS 10701 - lecture

Documents in this Course
lecture

lecture

12 pages

lecture

lecture

17 pages

HMMs

HMMs

40 pages

lecture

lecture

15 pages

lecture

lecture

20 pages

Notes

Notes

10 pages

Notes

Notes

15 pages

Lecture

Lecture

22 pages

Lecture

Lecture

13 pages

Lecture

Lecture

24 pages

Lecture9

Lecture9

38 pages

lecture

lecture

26 pages

lecture

lecture

13 pages

Lecture

Lecture

5 pages

lecture

lecture

18 pages

lecture

lecture

22 pages

Boosting

Boosting

11 pages

lecture

lecture

16 pages

lecture

lecture

20 pages

Lecture

Lecture

20 pages

Lecture

Lecture

39 pages

Lecture

Lecture

14 pages

Lecture

Lecture

18 pages

Lecture

Lecture

13 pages

Exam

Exam

10 pages

Lecture

Lecture

27 pages

Lecture

Lecture

15 pages

Lecture

Lecture

24 pages

Lecture

Lecture

16 pages

Lecture

Lecture

23 pages

Lecture6

Lecture6

28 pages

Notes

Notes

34 pages

lecture

lecture

15 pages

Midterm

Midterm

11 pages

lecture

lecture

11 pages

lecture

lecture

23 pages

Boosting

Boosting

35 pages

Lecture

Lecture

49 pages

Lecture

Lecture

22 pages

Lecture

Lecture

16 pages

Lecture

Lecture

18 pages

Lecture

Lecture

35 pages

lecture

lecture

22 pages

lecture

lecture

24 pages

Midterm

Midterm

17 pages

exam

exam

15 pages

Lecture12

Lecture12

32 pages

Lecture

Lecture

32 pages

boosting

boosting

11 pages

pca-mdps

pca-mdps

56 pages

bns

bns

45 pages

mdps

mdps

42 pages

svms

svms

10 pages

Notes

Notes

12 pages

lecture

lecture

42 pages

lecture

lecture

29 pages

lecture

lecture

15 pages

Lecture

Lecture

12 pages

Lecture

Lecture

24 pages

Lecture

Lecture

22 pages

Midterm

Midterm

5 pages

mdps-rl

mdps-rl

26 pages

Load more
Download lecture
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view lecture and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view lecture and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?