DOC PREVIEW
CMU CS 10701 - lecture

This preview shows page 1-2-3-4-5-6 out of 19 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 19 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 19 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 19 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 19 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 19 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 19 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 19 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

1Machine LearningMachine Learning1010--701/15701/15--781, Spring 2008781, Spring 2008OverfittingOverfittingand Model and Model SelectionSelectionEric XingEric XingLecture 13, February 27, 2008Reading: Chap. 1&2, CB & Chap 5,6, TMOutlinez Overfittingz Instance-based learningz Regressionz Bias-variance decompositionz The battle against overfitting: each learning algorithm has some "free knobs" that one can "tune" (i.e., heck) to make the algorithm generalizes better to test data. But is there a more principled way?z Cross validationz Regularizationz Model selection --- Occam's razorz Model averaging z The Bayesian-frequentist debatez Bayesian learning (weight models by their posterior probabilities)2Recall: Vector Space Representationz Each document is a vector, one component for each term (= word).z Normalize to unit length.z High-dimensional vector space:z Terms are axes, 10,000+ dimensions, or even 100,000+z Docs are vectors in this space...10112Word 3...000......310......180Word 2...003Word 1...Doc 3Doc 2Doc 1Classes in a Vector SpaceSportsScienceArts3Test Document = ?SportsScienceArtsK-Nearest Neighbor (kNN) classifier SportsScienceArts4Nearest-Neighbor Learning Algorithmz Learning is just storing the representations of the training examples in D.z Testing instance x:z Compute similarity between x and all examples in D.z Assign x the category of the most similar example in D.z Does not explicitly compute a generalization or category prototypes.z Also called:z Case-based learningz Memory-based learningz Lazy learningOverfitting5Another example:z RegressionOverfitting, con'dz The models:z Test errors:6What is a good model?Low RobustnessRobust ModelLow quality /High RobustnessModel built Known DataNew DataLEGENDBias-variance decompositionz Now let's look more closely into two sources of errors in an functional approximator:z In the following we show the Bias-variance decomposition using LR as an example.7Loss functions for regressionz Let t be the true (target) output and y(x) be our estimate. The expected squared loss isz Out goal is to choose y(x) that minimize E(L):z Calculus of variations:∫∫= dxdttxpxytLLE ),())(,()(∫∫−= dxdttxpxyt ),())((202 =−=∂∂∫dttxpxytxyLE),())(()()(∫∫= dttxtpdttxpxy ),(),()([] [ ]xtEtEdtxttpdtxptxtpxyxt|)|()(),()(*|====∫∫Expected lossz Let h(x) = E[t|x] be the optimal predictor, and y(x) our actual predictor, which will incur the following expected lossz is a noisy term, and we can do no better than this. Thus it is a lower bound of the expected loss.z The other part of the error come from , and let's take a close look of it.z We will assume y(x) = y(x|w) is a parametric model and the parameters ware fit to a training set D. (thus we write y(x;D) ) ()∫−+−=− dxdttxptxhxhxytxyE ),()()()())((22()()()()∫−+−−+−= dxdttxptxhtxhxhxyxhxy ),()()()()()()(222()()∫∫−+−= dxdttxptxhdxxpxhxy ),()()()()(22There is an error on pp47()∫− dxdttxptxh ),()(2()∫− dxxpxhxy )()()(28Bias-variance decompositionz For one data set D and one test point xz since the predictor y depend on the data training data D, write ED[y(x,D)] for the expected predictor over the ensemble of datasets, then (using the same trick) we have:z Surely this error term depends on the training data, so we take an expectation over them:z Putting things together:expected loss = (bias)2+ variance + noise()[][]()22)();();();()();( xhDxyEDxyEDxyxhDxyDD−+−=−[]()[]()[]()[]())();();();( )();();();(xhDxyEDxyEDxyxhDxyEDxyEDxyDDDD−−+−+−=222()[][]()[]()[]222);();()();()();( DxyEDxyExhDxyExhDxyEDDDD−+−=−Regularized Regression9Bias-variance tradeoffz λ is a "regularization" terms in LR, the smaller the λ, is more complex the model (why?)z Simple (highly regularized) models have low variance but high bias.z Complex models have low bias but high variance.zYou are inspecting an empirical average over 100 training set. z The actual ED can not be computedBias2+variance vs regularizerz Bias2+variance predicts (shape of) test error quite well.z However, bias and variance cannot be computed since it relies on knowing the true distribution of x and t (and hence h(x) = E[t|x]).10The battle against overfittingModel Selectionz Suppose we are trying select among several different models for a learning problem.z Examples:1. polynomial regressionz Model selection: we wish to automatically and objectively decide if k should be, say, 0, 1, . . . , or 10.2. locally weighted regression,z Model selection: we want to automatically choose the bandwidth parameterτ. 3. Mixture models and hidden Markov model,z Model selection: we want to decide the number of hidden statesz The Problem:z Given model family , find s.t. )();(kkxxxgxhθθθθθ++++= K2210{}IMMM ,,, K21=FF∈iM),(maxarg MDJMMiF∈=11Cross Validationz We are given training data D and test data Dtest, and we would like to fit this data with a model pi(x;θ) from the family F (e.g, an LR), which is indexed by i and parameterized by θ.z K-fold cross-validation (CV)z Set aside αN samples of D (where N = |D|). This is known as the held-out dataand will be used to evaluate different values of i.z For each candidate model i, fit the optimal hypothesis pi(x;θ∗) to the remaining (1−α)N samples in D (i.e., hold i fixed and find the best θ).z Evaluate each model pi(x|θ∗) on the held-out data using some pre-specified risk function.z Repeat the above K times, choosing a different held-out data set each time, and the scores are averaged for each model pi(.) over all held-out data set. This gives an estimate of the risk curve of models over different i.z For the model with the lowest rish, say pi*(.), we use all of D to find the parameter values for pi*(x;θ∗).Example:z When α=1/N, the algorithm is known as Leave-One-Out-Cross-Validation (LOOCV)MSELOOCV(M2)=0.962MSELOOCV(M1)=2.1212Practical issues for CVz How to decide the values for K and αz Commonly used K = 10 and α = 0.1.z when data sets are small relative to the number of models that are being evaluated, we need to decrease α and increase Kz K needs to be large for the variance to be small enough, but this makes it time-consuming.z Bias-variance trade-offz Small α usually lead to low bias. In principle, LOOCV provides an almost unbiased estimate of the generalization ability of a classifier, especially when the number of the available training samples is severely limited; but it can also have high


View Full Document

CMU CS 10701 - lecture

Documents in this Course
lecture

lecture

12 pages

lecture

lecture

17 pages

HMMs

HMMs

40 pages

lecture

lecture

15 pages

lecture

lecture

20 pages

Notes

Notes

10 pages

Notes

Notes

15 pages

Lecture

Lecture

22 pages

Lecture

Lecture

13 pages

Lecture

Lecture

24 pages

Lecture9

Lecture9

38 pages

lecture

lecture

26 pages

lecture

lecture

13 pages

Lecture

Lecture

5 pages

lecture

lecture

18 pages

lecture

lecture

22 pages

Boosting

Boosting

11 pages

lecture

lecture

16 pages

lecture

lecture

20 pages

Lecture

Lecture

20 pages

Lecture

Lecture

39 pages

Lecture

Lecture

14 pages

Lecture

Lecture

18 pages

Lecture

Lecture

13 pages

Exam

Exam

10 pages

Lecture

Lecture

27 pages

Lecture

Lecture

15 pages

Lecture

Lecture

24 pages

Lecture

Lecture

16 pages

Lecture

Lecture

23 pages

Lecture6

Lecture6

28 pages

Notes

Notes

34 pages

lecture

lecture

15 pages

Midterm

Midterm

11 pages

lecture

lecture

11 pages

lecture

lecture

23 pages

Boosting

Boosting

35 pages

Lecture

Lecture

49 pages

Lecture

Lecture

22 pages

Lecture

Lecture

16 pages

Lecture

Lecture

18 pages

Lecture

Lecture

35 pages

lecture

lecture

22 pages

lecture

lecture

24 pages

Midterm

Midterm

17 pages

exam

exam

15 pages

Lecture12

Lecture12

32 pages

Lecture

Lecture

32 pages

boosting

boosting

11 pages

pca-mdps

pca-mdps

56 pages

bns

bns

45 pages

mdps

mdps

42 pages

svms

svms

10 pages

Notes

Notes

12 pages

lecture

lecture

42 pages

lecture

lecture

29 pages

lecture

lecture

15 pages

Lecture

Lecture

12 pages

Lecture

Lecture

24 pages

Lecture

Lecture

22 pages

Midterm

Midterm

5 pages

mdps-rl

mdps-rl

26 pages

Load more
Download lecture
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view lecture and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view lecture 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?