CMU CS 10701 - Lecture - D318661

Home> Schools> Carnegie Mellon University> Computer Science (CS) > CS 10701> Lecture

DOC PREVIEW

CMU CS 10701 - Lecture

School name Carnegie Mellon University

Course Cs 10701- Introduction to Machine Learning

Pages 12

This preview shows page 1-2-3-4 out of 12 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 12 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Machine Learning 1010 701 15701 15 781 Spring 2008 Model Feature Selection Eric Xing Lecture 14 March 3 2008 Reading Chap 1 2 CB Chap 5 6 TM Bias variance decomposition z For one data set D and one test point x E x t D y x t 2 E y x D h x p x dx E y x D E y x D p x dx h x t p x t dxdt 2 D 2 D D 2 z expected loss bias 2 variance noise Recall the VC bound 1 Minimizing Empirical Risk Structural Risk Risk Best Model Total Risk Empirical Risk Confidence interval In h L h Model Complexity SRM ERM in practice z There are many SRM based strategies to build models z In the case of linear models y w x b one wants to make w a controlled parameter let us call HC the linear model function family satisfying the constraint w C Vapnik Major theorem When C decreases d HC decreases x R 2 Regularization z Maximum likelihood estimates are not always the best James and Stein showed a counter example in the early 60 s z Alternative we regularize the likelihood objective also known as penalized likelihood shrinkage smoothing etc by adding to it a penalty term shrinkage arg max l D where 0 and might be the L1 or L2 norm z The choice of norm has an effect z using the L2 norm pulls directly towards the origin z while using the L1 norm pulls towards the coordinate axes i e it tries to set some of the coordinates to 0 z This second approach can be useful in a feature selection setting Bayesian and Frequentist z z Frequentist interpretation of probability z Probabilities are objective properties of the real world and refer to limiting relative frequencies e g number of times I have observed heads Hence one cannot write P Katrina could have been prevented D since the event will never repeat z Parameters of models are fixed unknown constants Hence one cannot write P D since does not have a probability distribution Instead one can only write P D z One computes point estimates of parameters using various estimators f D which are designed to have various desirable qualities when averaged over future data D assumed to be drawn from the true distribution Bayesian interpretation of probability z Probability describes degrees of belief not limiting frequencies z Parameters of models are hidden variables so one can compute P D or P f D for some function f z One estimates parameters by computing P D using Bayes rule p D p D p p D 3 Bayesian interpretation of regulation z Regularized Linear Regression z Recall that using squared error as the cost function results in the LMS estimate z And assume iid data and Gaussian noise LMS is equivalent to MLE of l n log z 1 1 1 n 2 i 1 yi T x i 2 2 2 Now assume that vector follows a normal prior with 0 mean and a diagonal covariance matrix N 0 2 I z What is the posterior distribution of p D p D p D p 2 2 n 2 2 1 n exp 2 yn T xi C exp T 2 2 2 i 1 Bayesian interpretation of regulation con d z The posterior distribution of 2 1 n p D exp 2 yn T xi exp T 2 2 2 i 1 z This leads to a now objective 1 1 n 1 1 K yi T x i 2 2 k 1 k2 2 i 1 2 2 2 l D lMAP D z z This is L2 regularized LR a MAP estimation of z What about L1 regularized LR homework How to choose z cross validation 4 Feature Selection z Imagine that you have a supervised learning problem where the number of features n is very large perhaps n samples but you suspect that there is only a small number of features that are relevant to the learning task z VC theory can tell you that this scenario is likely to lead to high generalization error the learned model will potentially overfit unless the training set is fairly large z So lets get rid of useless parameters How to score features z How do you know which features can be pruned z Given labeled data we can compute some simple score S i that measures how informative each feature xi is about the class labels y z Ranking criteria z Mutual Information score each feature by its mutual information with respect to the class labels MI xi y z xi 0 1 y 0 1 p xi y log z Bayes error z Redundancy Markov blank score p xi y p xi p y We need estimate the relevant p s from data e g using MLE 5 Feature Ranking z Bayes error of each gene z information gain for each genes with respect to the given partition z KL of each removal gene w r t to its MB Feature selection schemes z Given n features there are 2n possible feature subsets why z Thus feature selection can be posed as a model selection problem over 2n possible models z For large values of n it s usually too expensive to explicitly enumerate over and compare all 2n models Some heuristic search procedure is used to find a good feature subset z Three general approaches z Filter i e direct feature ranking but taking no consideration of the subsequent learning algorithm z add from empty set or remove from the full set features one by one based on S i z Cheap but is subject to local optimality and may be unrobust under different classifiers z Wrapper determine the inclusion or removal of features based on performance under the learning algorithms to be used See next slide z Simultaneous learning and feature selection z E x L1 regularized LR Bayesian feature selection will not cover in this class etc 6 Wrapper z Forward 1 Initialize F 2 Repeat z For i 1 n if i F let Fi F i and use some version of cross validation to evaluate features F i I e train your learning algorithm using only the features in F i and estimate its generalization error z 3 z Set F to be the best feature subset found on the last step step Select and output the best feature subset that was evaluated during the entire search procedure Backward search 1 Initialize F full set 2 Case study z Xing et al 2001 The case z 7130 genes from a microarray dataset z 72 samples z 47 type I Leukemias called ALL and 25 type II Leukemias called AML z Three classifier z kNN z Gaussian classifier z Logistic regression 7 Regularization vs Feature Selection Explicit feature selection often outperform regularization Feature Selection regression z Model Selection z Suppose we are trying select among several different models for a learning problem z Examples 1 polynomial regression h x g 0 1 x 2 x 2 K k …

View Full Document

CMU CS 10701 - Lecture

Sign up for free to view:

This document and 3 million+ documents and flashcards
High quality study guides, lecture notes, practice exams
Course Packets handpicked by editors offering a comprehensive review of your courses
Better Grades Guaranteed


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2-3-4 out of 12 pages.

CMU CS 10701 - Lecture

Sign up for free to view:

Please select your school