DOC PREVIEW
CMU CS 10701 - lecture

This preview shows page 1-2-3-4 out of 11 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 11 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 11 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 11 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 11 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 11 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

1Machine LearningMachine Learning1010--701/15701/15--781, Spring 2008781, Spring 2008Model SelectionModel SelectionEric XingEric XingLecture 14, March 2, 2008Reading: Chap. 1&2, CB & Chap 5,6, TMBias-variance decompositionz For one data set D and one test point x⇒ expected loss = (bias)2+ variance + noisez Recall the VC bound:[][]()[]()[]()∫∫∫−+−+−=−dxdttxptxhdxxpDxyEDxyEdxxpxhDxyEtxyEDDDDtx),()()();();()()();( ))(( ),,(22222Empirical RiskRiskModel ComplexityTotal RiskConfidence intervalIn h/LBest Modelh*Minimizing Empirical Risk & Structural Risk SRM & ERM in practicez There are many SRM-based strategies to build models:z In the case of linear modelsy = <w|x> + b,one wants to make ||w|| a controlled parameter: let us call HCthe linear model function family satisfying the constraint:||w|| < CVapnik Major theorem:When C decreases, d(HC) decreases||x|| < R3Regularizationz Maximum-likelihood estimates are not always the best (James and Stein showed a counter example in the early 60's)z Alternative: we "regularize" the likelihood objective (also known as penalized likelihood, shrinkage, smoothing, etc.), by adding to it a penalty term:where λ>0 and ||θ|| might be the L1or L2norm.z The choice of norm has an effectz using the L2norm pulls directly towards the origin, z while using the L1norm pulls towards the coordinate axes, i.e it tries to set some of the coordinates to 0. z This second approach can be useful in a feature-selection setting.[]θλθθθ−= );(maxargˆshrinkageDlBayesian and Frequentistz Frequentist interpretation of probabilityz Probabilities are objective properties of the real world, and refer to limiting relative frequencies (e.g., number of times I have observed heads). Hence one cannot write P(Katrina could have been prevented|D), since the event will never repeat.z Parameters of models are fixed, unknown constants. Hence one cannot write P(θ|D) since θ does not have a probability distribution. Instead one can only write P(D|θ).z One computes point estimates of parameters using various estimators, θ*= f(D), which are designed to have various desirable qualities when averaged over future data D (assumed to be drawn from the “true” distribution).z Bayesian interpretation of probabilityz Probability describes degrees of belief, not limiting frequencies.z Parameters of models are hidden variables, so one can compute P(θ|D) or P(f(θ)|D) for some function f.z One estimates parameters by computing P(θ|D) using Bayes rule:)()()|()(DppDpDθpθθ=4Bayesian interpretation of regulationz Regularized Linear Regression z Recall that using squared error as the cost function results in the LMS estimatez And assume iid data and Gaussian noise, LMS is equivalent to MLE of θz Now assume that vector θfollows a normal prior with 0-mean and a diagonal covariance matrixz What is the posterior distribution of θ?∑=−−=niiTiynl12221121)(log)( xθσσπθ),(~ IN20τθ() () { }2122222212 τθ/(θCxθyσπσθpD|θpD,θpDθpTniiTnn/−×⎭⎬⎫⎩⎨⎧−−==∝∑=−expexp)()()()(Bayesian interpretation of regulation, con'dz The posterior distribution of θz This leads to a now objectivez This is L2regularized LR! --- a MAP estimation of θz What about L1regularized LR! (homework)z How to choose λ. z cross-validation!θλθθτθσθ−=−−−=∑∑==);()();(DlyDlKkkniiTiMAP1221222112121x(){ }2122221τθ/θxθyσDθpTniiTn−×⎭⎬⎫⎩⎨⎧−−∝∑=expexp)(5Feature Selectionz Imagine that you have a supervised learning problem where the number of features n is very large (perhaps n >>#samples), but you suspect that there is only a small number of features that are "relevant" to the learning task. z VC-theory can tell you that this scenario is likely to lead to high generalization error – the learned model will potentially overfit unless the training set is fairly large.z So lets get rid of useless parameters!How to score featuresz How do you know which features can be pruned?z Given labeled data, we can compute some simple score S(i) that measures how informative each feature xiis about the class labels y.z Ranking criteria:z Mutual Information: score each feature by its mutual information with respect to the class labelsz Bayes error:z Redundancy (Markov-blank score) …z We need estimate the relevant p()'s from data, e.g., using MLE∑∑∈∈=},{},{)()(),(log),(),(1010ixyiiiiypxpyxpyxpyxMI6Feature Ranking z Bayes error of each genez information gain for each genes with respect to the given partitionz KL of each removal gene w.r.t. to its MBFeature selection schemesz Given n features, there are 2npossible feature subsets (why?)z Thus feature selection can be posed as a model selection problem over 2npossible models.z For large values of n, it's usually too expensive to explicitly enumerate over and compare all 2nmodels. Some heuristic search procedure is used to find a good feature subset.z Three general approaches:z Filter: i.e., direct feature ranking, but taking no consideration of the subsequent learning algorithmz add (from empty set) or remove (from the full set) features one by one based on S(i)z Cheap, but is subject to local optimality and may be unrobust under different classifiers z Wrapper: determine the (inclusion or removal of) features based on performance under the learning algorithms to be used. See next slidez Simultaneous learning and feature selection.z E.x. L1regularized LR, Bayesian feature selection (will not cover in this class), etc.7Wrapperz Forward:1. Initialize F= Ø2. Repeatz For i = 1, … , n if , let , and use some version of cross validation to evaluate features Fi. (I.e., train your learning algorithm using only the features in Fi, and estimate its generalization error.)z Set Fto be the best feature subset found on the last step step.3. Select and output the best feature subset that was evaluated during the entire search procedure.z Backward search1. Initialize F= full set 2. …F∉i}{ii∪=FFCase study [Xing et al, 2001]z The case: z 7130 genes from a microarray datasetz 72 samplesz 47 type I Leukemias (called ALL) and 25 type II Leukemias (called AML)z Three classifier:z kNNz Gaussian classifierz Logistic regression8Regularization vs. Feature Selectionz Explicit feature selection often outperform regularizationregressionFeature SelectionModel Selectionz Suppose we are trying select among several different models for a learning problem.z Examples:1. polynomial


View Full Document

CMU CS 10701 - lecture

Documents in this Course
lecture

lecture

12 pages

lecture

lecture

17 pages

HMMs

HMMs

40 pages

lecture

lecture

15 pages

lecture

lecture

20 pages

Notes

Notes

10 pages

Notes

Notes

15 pages

Lecture

Lecture

22 pages

Lecture

Lecture

13 pages

Lecture

Lecture

24 pages

Lecture9

Lecture9

38 pages

lecture

lecture

26 pages

lecture

lecture

13 pages

Lecture

Lecture

5 pages

lecture

lecture

18 pages

lecture

lecture

22 pages

Boosting

Boosting

11 pages

lecture

lecture

16 pages

lecture

lecture

20 pages

Lecture

Lecture

20 pages

Lecture

Lecture

39 pages

Lecture

Lecture

14 pages

Lecture

Lecture

18 pages

Lecture

Lecture

13 pages

Exam

Exam

10 pages

Lecture

Lecture

27 pages

Lecture

Lecture

15 pages

Lecture

Lecture

24 pages

Lecture

Lecture

16 pages

Lecture

Lecture

23 pages

Lecture6

Lecture6

28 pages

Notes

Notes

34 pages

lecture

lecture

15 pages

Midterm

Midterm

11 pages

lecture

lecture

23 pages

Boosting

Boosting

35 pages

Lecture

Lecture

49 pages

Lecture

Lecture

22 pages

Lecture

Lecture

16 pages

Lecture

Lecture

18 pages

Lecture

Lecture

35 pages

lecture

lecture

22 pages

lecture

lecture

24 pages

Midterm

Midterm

17 pages

exam

exam

15 pages

Lecture12

Lecture12

32 pages

lecture

lecture

19 pages

Lecture

Lecture

32 pages

boosting

boosting

11 pages

pca-mdps

pca-mdps

56 pages

bns

bns

45 pages

mdps

mdps

42 pages

svms

svms

10 pages

Notes

Notes

12 pages

lecture

lecture

42 pages

lecture

lecture

29 pages

lecture

lecture

15 pages

Lecture

Lecture

12 pages

Lecture

Lecture

24 pages

Lecture

Lecture

22 pages

Midterm

Midterm

5 pages

mdps-rl

mdps-rl

26 pages

Load more
Download lecture
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view lecture and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view lecture 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?