Machine Learning 1010 701 15701 15 781 Spring 2008 Model Feature Selection Eric Xing Lecture 14 March 3 2008 Reading Chap 1 2 CB Chap 5 6 TM Bias variance decomposition z For one data set D and one test point x E x t D y x t 2 E y x D h x p x dx E y x D E y x D p x dx h x t p x t dxdt 2 D 2 D D 2 z expected loss bias 2 variance noise Recall the VC bound 1 Minimizing Empirical Risk Structural Risk Risk Best Model Total Risk Empirical Risk Confidence interval In h L h Model Complexity SRM ERM in practice z There are many SRM based strategies to build models z In the case of linear models y w x b one wants to make w a controlled parameter let us call HC the linear model function family satisfying the constraint w C Vapnik Major theorem When C decreases d HC decreases x R 2 Regularization z Maximum likelihood estimates are not always the best James and Stein showed a counter example in the early 60 s z Alternative we regularize the likelihood objective also known as penalized likelihood shrinkage smoothing etc by adding to it a penalty term shrinkage arg max l D where 0 and might be the L1 or L2 norm z The choice of norm has an effect z using the L2 norm pulls directly towards the origin z while using the L1 norm pulls towards the coordinate axes i e it tries to set some of the coordinates to 0 z This second approach can be useful in a feature selection setting Bayesian and Frequentist z z Frequentist interpretation of probability z Probabilities are objective properties of the real world and refer to limiting relative frequencies e g number of times I have observed heads Hence one cannot write P Katrina could have been prevented D since the event will never repeat z Parameters of models are fixed unknown constants Hence one cannot write P D since does not have a probability distribution Instead one can only write P D z One computes point estimates of parameters using various estimators f D which are designed to have various desirable qualities when averaged over future data D assumed to be drawn from the true distribution Bayesian interpretation of probability z Probability describes degrees of belief not limiting frequencies z Parameters of models are hidden variables so one can compute P D or P f D for some function f z One estimates parameters by computing P D using Bayes rule p D p D p p D 3 Bayesian interpretation of regulation z Regularized Linear Regression z Recall that using squared error as the cost function results in the LMS estimate z And assume iid data and Gaussian noise LMS is equivalent to MLE of l n log z 1 1 1 n 2 i 1 yi T x i 2 2 2 Now assume that vector follows a normal prior with 0 mean and a diagonal covariance matrix N 0 2 I z What is the posterior distribution of p D p D p D p 2 2 n 2 2 1 n exp 2 yn T xi C exp T 2 2 2 i 1 Bayesian interpretation of regulation con d z The posterior distribution of 2 1 n p D exp 2 yn T xi exp T 2 2 2 i 1 z This leads to a now objective 1 1 n 1 1 K yi T x i 2 2 k 1 k2 2 i 1 2 2 2 l D lMAP D z z This is L2 regularized LR a MAP estimation of z What about L1 regularized LR homework How to choose z cross validation 4 Feature Selection z Imagine that you have a supervised learning problem where the number of features n is very large perhaps n samples but you suspect that there is only a small number of features that are relevant to the learning task z VC theory can tell you that this scenario is likely to lead to high generalization error the learned model will potentially overfit unless the training set is fairly large z So lets get rid of useless parameters How to score features z How do you know which features can be pruned z Given labeled data we can compute some simple score S i that measures how informative each feature xi is about the class labels y z Ranking criteria z Mutual Information score each feature by its mutual information with respect to the class labels MI xi y z xi 0 1 y 0 1 p xi y log z Bayes error z Redundancy Markov blank score p xi y p xi p y We need estimate the relevant p s from data e g using MLE 5 Feature Ranking z Bayes error of each gene z information gain for each genes with respect to the given partition z KL of each removal gene w r t to its MB Feature selection schemes z Given n features there are 2n possible feature subsets why z Thus feature selection can be posed as a model selection problem over 2n possible models z For large values of n it s usually too expensive to explicitly enumerate over and compare all 2n models Some heuristic search procedure is used to find a good feature subset z Three general approaches z Filter i e direct feature ranking but taking no consideration of the subsequent learning algorithm z add from empty set or remove from the full set features one by one based on S i z Cheap but is subject to local optimality and may be unrobust under different classifiers z Wrapper determine the inclusion or removal of features based on performance under the learning algorithms to be used See next slide z Simultaneous learning and feature selection z E x L1 regularized LR Bayesian feature selection will not cover in this class etc 6 Wrapper z Forward 1 Initialize F 2 Repeat z For i 1 n if i F let Fi F i and use some version of cross validation to evaluate features F i I e train your learning algorithm using only the features in F i and estimate its generalization error z 3 z Set F to be the best feature subset found on the last step step Select and output the best feature subset that was evaluated during the entire search procedure Backward search 1 Initialize F full set 2 Case study z Xing et al 2001 The case z 7130 genes from a microarray dataset z 72 samples z 47 type I Leukemias called ALL and 25 type II Leukemias called AML z Three classifier z kNN z Gaussian classifier z Logistic regression 7 Regularization vs Feature Selection Explicit feature selection often outperform regularization Feature Selection regression z Model Selection z Suppose we are trying select among several different models for a learning problem z Examples 1 polynomial regression h x g 0 1 x 2 x 2 K k …
View Full Document