DOC PREVIEW
CMU CS 10701 - Lecture

This preview shows page 1-2-3-4 out of 12 pages.

Save
View full document
Premium Document
Do you want full access? Go Premium and unlock all 12 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Machine Learning 1010 701 15701 15 781 Spring 2008 Model Feature Selection Eric Xing Lecture 14 March 3 2008 Reading Chap 1 2 CB Chap 5 6 TM Bias variance decomposition z For one data set D and one test point x E x t D y x t 2 E y x D h x p x dx E y x D E y x D p x dx h x t p x t dxdt 2 D 2 D D 2 z expected loss bias 2 variance noise Recall the VC bound 1 Minimizing Empirical Risk Structural Risk Risk Best Model Total Risk Empirical Risk Confidence interval In h L h Model Complexity SRM ERM in practice z There are many SRM based strategies to build models z In the case of linear models y w x b one wants to make w a controlled parameter let us call HC the linear model function family satisfying the constraint w C Vapnik Major theorem When C decreases d HC decreases x R 2 Regularization z Maximum likelihood estimates are not always the best James and Stein showed a counter example in the early 60 s z Alternative we regularize the likelihood objective also known as penalized likelihood shrinkage smoothing etc by adding to it a penalty term shrinkage arg max l D where 0 and might be the L1 or L2 norm z The choice of norm has an effect z using the L2 norm pulls directly towards the origin z while using the L1 norm pulls towards the coordinate axes i e it tries to set some of the coordinates to 0 z This second approach can be useful in a feature selection setting Bayesian and Frequentist z z Frequentist interpretation of probability z Probabilities are objective properties of the real world and refer to limiting relative frequencies e g number of times I have observed heads Hence one cannot write P Katrina could have been prevented D since the event will never repeat z Parameters of models are fixed unknown constants Hence one cannot write P D since does not have a probability distribution Instead one can only write P D z One computes point estimates of parameters using various estimators f D which are designed to have various desirable qualities when averaged over future data D assumed to be drawn from the true distribution Bayesian interpretation of probability z Probability describes degrees of belief not limiting frequencies z Parameters of models are hidden variables so one can compute P D or P f D for some function f z One estimates parameters by computing P D using Bayes rule p D p D p p D 3 Bayesian interpretation of regulation z Regularized Linear Regression z Recall that using squared error as the cost function results in the LMS estimate z And assume iid data and Gaussian noise LMS is equivalent to MLE of l n log z 1 1 1 n 2 i 1 yi T x i 2 2 2 Now assume that vector follows a normal prior with 0 mean and a diagonal covariance matrix N 0 2 I z What is the posterior distribution of p D p D p D p 2 2 n 2 2 1 n exp 2 yn T xi C exp T 2 2 2 i 1 Bayesian interpretation of regulation con d z The posterior distribution of 2 1 n p D exp 2 yn T xi exp T 2 2 2 i 1 z This leads to a now objective 1 1 n 1 1 K yi T x i 2 2 k 1 k2 2 i 1 2 2 2 l D lMAP D z z This is L2 regularized LR a MAP estimation of z What about L1 regularized LR homework How to choose z cross validation 4 Feature Selection z Imagine that you have a supervised learning problem where the number of features n is very large perhaps n samples but you suspect that there is only a small number of features that are relevant to the learning task z VC theory can tell you that this scenario is likely to lead to high generalization error the learned model will potentially overfit unless the training set is fairly large z So lets get rid of useless parameters How to score features z How do you know which features can be pruned z Given labeled data we can compute some simple score S i that measures how informative each feature xi is about the class labels y z Ranking criteria z Mutual Information score each feature by its mutual information with respect to the class labels MI xi y z xi 0 1 y 0 1 p xi y log z Bayes error z Redundancy Markov blank score p xi y p xi p y We need estimate the relevant p s from data e g using MLE 5 Feature Ranking z Bayes error of each gene z information gain for each genes with respect to the given partition z KL of each removal gene w r t to its MB Feature selection schemes z Given n features there are 2n possible feature subsets why z Thus feature selection can be posed as a model selection problem over 2n possible models z For large values of n it s usually too expensive to explicitly enumerate over and compare all 2n models Some heuristic search procedure is used to find a good feature subset z Three general approaches z Filter i e direct feature ranking but taking no consideration of the subsequent learning algorithm z add from empty set or remove from the full set features one by one based on S i z Cheap but is subject to local optimality and may be unrobust under different classifiers z Wrapper determine the inclusion or removal of features based on performance under the learning algorithms to be used See next slide z Simultaneous learning and feature selection z E x L1 regularized LR Bayesian feature selection will not cover in this class etc 6 Wrapper z Forward 1 Initialize F 2 Repeat z For i 1 n if i F let Fi F i and use some version of cross validation to evaluate features F i I e train your learning algorithm using only the features in F i and estimate its generalization error z 3 z Set F to be the best feature subset found on the last step step Select and output the best feature subset that was evaluated during the entire search procedure Backward search 1 Initialize F full set 2 Case study z Xing et al 2001 The case z 7130 genes from a microarray dataset z 72 samples z 47 type I Leukemias called ALL and 25 type II Leukemias called AML z Three classifier z kNN z Gaussian classifier z Logistic regression 7 Regularization vs Feature Selection Explicit feature selection often outperform regularization Feature Selection regression z Model Selection z Suppose we are trying select among several different models for a learning problem z Examples 1 polynomial regression h x g 0 1 x 2 x 2 K k …


View Full Document

CMU CS 10701 - Lecture

Documents in this Course
lecture

lecture

12 pages

lecture

lecture

17 pages

HMMs

HMMs

40 pages

lecture

lecture

15 pages

lecture

lecture

20 pages

Notes

Notes

10 pages

Notes

Notes

15 pages

Lecture

Lecture

22 pages

Lecture

Lecture

13 pages

Lecture

Lecture

24 pages

Lecture9

Lecture9

38 pages

lecture

lecture

26 pages

lecture

lecture

13 pages

Lecture

Lecture

5 pages

lecture

lecture

18 pages

lecture

lecture

22 pages

Boosting

Boosting

11 pages

lecture

lecture

16 pages

lecture

lecture

20 pages

Lecture

Lecture

20 pages

Lecture

Lecture

39 pages

Lecture

Lecture

14 pages

Lecture

Lecture

18 pages

Lecture

Lecture

13 pages

Exam

Exam

10 pages

Lecture

Lecture

27 pages

Lecture

Lecture

15 pages

Lecture

Lecture

24 pages

Lecture

Lecture

16 pages

Lecture

Lecture

23 pages

Lecture6

Lecture6

28 pages

Notes

Notes

34 pages

lecture

lecture

15 pages

Midterm

Midterm

11 pages

lecture

lecture

11 pages

lecture

lecture

23 pages

Boosting

Boosting

35 pages

Lecture

Lecture

49 pages

Lecture

Lecture

22 pages

Lecture

Lecture

16 pages

Lecture

Lecture

18 pages

Lecture

Lecture

35 pages

lecture

lecture

22 pages

lecture

lecture

24 pages

Midterm

Midterm

17 pages

exam

exam

15 pages

Lecture12

Lecture12

32 pages

lecture

lecture

19 pages

Lecture

Lecture

32 pages

boosting

boosting

11 pages

pca-mdps

pca-mdps

56 pages

bns

bns

45 pages

mdps

mdps

42 pages

svms

svms

10 pages

Notes

Notes

12 pages

lecture

lecture

42 pages

lecture

lecture

29 pages

lecture

lecture

15 pages

Lecture

Lecture

24 pages

Lecture

Lecture

22 pages

Midterm

Midterm

5 pages

mdps-rl

mdps-rl

26 pages

Load more
Download Lecture
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?