1Machine LearningMachine Learning1010--701/15701/15--781, Spring 2008781, Spring 2008Model SelectionModel SelectionEric XingEric XingLecture 14, March 2, 2008Reading: Chap. 1&2, CB & Chap 5,6, TMBias-variance decompositionz For one data set D and one test point x⇒ expected loss = (bias)2+ variance + noisez Recall the VC bound:[][]()[]()[]()∫∫∫−+−+−=−dxdttxptxhdxxpDxyEDxyEdxxpxhDxyEtxyEDDDDtx),()()();();()()();( ))(( ),,(22222Empirical RiskRiskModel ComplexityTotal RiskConfidence intervalIn h/LBest Modelh*Minimizing Empirical Risk & Structural Risk SRM & ERM in practicez There are many SRM-based strategies to build models:z In the case of linear modelsy = <w|x> + b,one wants to make ||w|| a controlled parameter: let us call HCthe linear model function family satisfying the constraint:||w|| < CVapnik Major theorem:When C decreases, d(HC) decreases||x|| < R3Regularizationz Maximum-likelihood estimates are not always the best (James and Stein showed a counter example in the early 60's)z Alternative: we "regularize" the likelihood objective (also known as penalized likelihood, shrinkage, smoothing, etc.), by adding to it a penalty term:where λ>0 and ||θ|| might be the L1or L2norm.z The choice of norm has an effectz using the L2norm pulls directly towards the origin, z while using the L1norm pulls towards the coordinate axes, i.e it tries to set some of the coordinates to 0. z This second approach can be useful in a feature-selection setting.[]θλθθθ−= );(maxargˆshrinkageDlBayesian and Frequentistz Frequentist interpretation of probabilityz Probabilities are objective properties of the real world, and refer to limiting relative frequencies (e.g., number of times I have observed heads). Hence one cannot write P(Katrina could have been prevented|D), since the event will never repeat.z Parameters of models are fixed, unknown constants. Hence one cannot write P(θ|D) since θ does not have a probability distribution. Instead one can only write P(D|θ).z One computes point estimates of parameters using various estimators, θ*= f(D), which are designed to have various desirable qualities when averaged over future data D (assumed to be drawn from the “true” distribution).z Bayesian interpretation of probabilityz Probability describes degrees of belief, not limiting frequencies.z Parameters of models are hidden variables, so one can compute P(θ|D) or P(f(θ)|D) for some function f.z One estimates parameters by computing P(θ|D) using Bayes rule:)()()|()(DppDpDθpθθ=4Bayesian interpretation of regulationz Regularized Linear Regression z Recall that using squared error as the cost function results in the LMS estimatez And assume iid data and Gaussian noise, LMS is equivalent to MLE of θz Now assume that vector θfollows a normal prior with 0-mean and a diagonal covariance matrixz What is the posterior distribution of θ?∑=−−=niiTiynl12221121)(log)( xθσσπθ),(~ IN20τθ() () { }2122222212 τθ/(θCxθyσπσθpD|θpD,θpDθpTniiTnn/−×⎭⎬⎫⎩⎨⎧−−==∝∑=−expexp)()()()(Bayesian interpretation of regulation, con'dz The posterior distribution of θz This leads to a now objectivez This is L2regularized LR! --- a MAP estimation of θz What about L1regularized LR! (homework)z How to choose λ. z cross-validation!θλθθτθσθ−=−−−=∑∑==);()();(DlyDlKkkniiTiMAP1221222112121x(){ }2122221τθ/θxθyσDθpTniiTn−×⎭⎬⎫⎩⎨⎧−−∝∑=expexp)(5Feature Selectionz Imagine that you have a supervised learning problem where the number of features n is very large (perhaps n >>#samples), but you suspect that there is only a small number of features that are "relevant" to the learning task. z VC-theory can tell you that this scenario is likely to lead to high generalization error – the learned model will potentially overfit unless the training set is fairly large.z So lets get rid of useless parameters!How to score featuresz How do you know which features can be pruned?z Given labeled data, we can compute some simple score S(i) that measures how informative each feature xiis about the class labels y.z Ranking criteria:z Mutual Information: score each feature by its mutual information with respect to the class labelsz Bayes error:z Redundancy (Markov-blank score) …z We need estimate the relevant p()'s from data, e.g., using MLE∑∑∈∈=},{},{)()(),(log),(),(1010ixyiiiiypxpyxpyxpyxMI6Feature Ranking z Bayes error of each genez information gain for each genes with respect to the given partitionz KL of each removal gene w.r.t. to its MBFeature selection schemesz Given n features, there are 2npossible feature subsets (why?)z Thus feature selection can be posed as a model selection problem over 2npossible models.z For large values of n, it's usually too expensive to explicitly enumerate over and compare all 2nmodels. Some heuristic search procedure is used to find a good feature subset.z Three general approaches:z Filter: i.e., direct feature ranking, but taking no consideration of the subsequent learning algorithmz add (from empty set) or remove (from the full set) features one by one based on S(i)z Cheap, but is subject to local optimality and may be unrobust under different classifiers z Wrapper: determine the (inclusion or removal of) features based on performance under the learning algorithms to be used. See next slidez Simultaneous learning and feature selection.z E.x. L1regularized LR, Bayesian feature selection (will not cover in this class), etc.7Wrapperz Forward:1. Initialize F= Ø2. Repeatz For i = 1, … , n if , let , and use some version of cross validation to evaluate features Fi. (I.e., train your learning algorithm using only the features in Fi, and estimate its generalization error.)z Set Fto be the best feature subset found on the last step step.3. Select and output the best feature subset that was evaluated during the entire search procedure.z Backward search1. Initialize F= full set 2. …F∉i}{ii∪=FFCase study [Xing et al, 2001]z The case: z 7130 genes from a microarray datasetz 72 samplesz 47 type I Leukemias (called ALL) and 25 type II Leukemias (called AML)z Three classifier:z kNNz Gaussian classifierz Logistic regression8Regularization vs. Feature Selectionz Explicit feature selection often outperform regularizationregressionFeature SelectionModel Selectionz Suppose we are trying select among several different models for a learning problem.z Examples:1. polynomial
View Full Document