Machine learning lecture 9 Tommi S Jaakkola MIT CSAIL tommi csail mit edu Topics Feature selection motivation examples information value greedy selection regularization Combination methods forward backward fitting boosting Tommi Jaakkola MIT CSAIL 2 Feature selection Suppose we consider only a finite collection of possible basis functions i x i 1 m such as the input components i x xi We try to find a small subset S of basis functions that are sufficient to solve the regression or classification problem P y 1 x w g w0 k X wi si x i 1 where the indexes S s1 sk identify the selected basis functions Tommi Jaakkola MIT CSAIL 3 Feature selection Suppose we consider only a finite collection of possible basis functions i x i 1 m such as the input components i x xi We try to find a small subset S of basis functions that are sufficient to solve the regression or classification problem P y 1 x w g w0 k X wi si x i 1 where the indexes S s1 sk identify the selected basis functions There are many ways to find appropriate basis functions information value greedy selection regularization Tommi Jaakkola MIT CSAIL 4 Information value Let s first try to select the basis functions independently of the classifier i e gauge how informative they are in general about the class label Text classification example x is a document and the basis functions 1 x m x are word indicators 1 if document x contains word i i x 0 otherwise each document is represented by a binary vector T x 0 1 0 0 1 z m bits we will derive a score for each feature bit based on how much information it contains about the class label Tommi Jaakkola MIT CSAIL 5 Information value cont d Let s focus on a single feature e g the first one 1 0 1 0 1 y 1 1 1 1 To assess how the feature values relate to the labels we can calculate the frequency of occurence of different combinations of values P y P 1 P 1 y For example of docs such that 1 x 0 and y 1 P 1 0 y 1 n Tommi Jaakkola MIT CSAIL 6 Information value cont d Let s focus on a single feature e g the first one 1 0 1 0 1 y 1 1 1 1 To assess how the feature values relate to the labels we can calculate the frequency of occurence of different combinations of values P y P 1 P 1 y The mutual information score for each feature is given by I 1 y X X 1 0 1 y 1 1 P 1 y log2 P 1 y P y P 1 This score is zero if the label is independent of the feature value large but 1 if they are deterministically related Tommi Jaakkola MIT CSAIL 7 Selection by information value We rank the features according to their mutual information scores in the descending order of the score I 1 y X X 1 0 1 y 1 1 P 1 y log2 P 1 y P y P 1 how many features to include redundant features coordination among the features which classifier can make use of these features Tommi Jaakkola MIT CSAIL 8 Greedy selection 1 Find s1 and w w0 w1 T such that P y 1 x w g w0 w1 s1 x leads to the best classifier Tommi Jaakkola MIT CSAIL 9 Greedy selection 1 Find s1 and w w0 w1 T such that P y 1 x w g w0 w1 s1 x leads to the best classifier 2 Find s2 and w w0 w1 w2 T such that P y 1 x w g w0 w1 s1 x w2 s2 x gives the best performing classifier Tommi Jaakkola MIT CSAIL 10 Greedy selection 1 Find s1 and w w0 w1 T such that P y 1 x w g w0 w1 s1 x leads to the best classifier 2 Find s2 and w w0 w1 w2 T such that P y 1 x w g w0 w1 s1 x w2 s2 x gives the best performing classifier 3 Etc stopping criterion over fitting Tommi Jaakkola MIT CSAIL 11 Regularization We can also consider all of the basis functions at once P y 1 x w g w0 w1 1 x wm m x and introduce a regularization penalty that tries to set the weights to zero unless the corresponding basis functions are useful Tommi Jaakkola MIT CSAIL 12 Regularization We can also consider all of the basis functions at once P y 1 x w g w0 w1 1 x wm m x and introduce a regularization penalty that tries to set the weights to zero unless the corresponding basis functions are useful J w n X i 1 log P yi x w m X wi i 1 In other words we regularize the 1 norm not Euclidean norm of the weights w0 is not penalized Tommi Jaakkola MIT CSAIL 13 Regularization The effect of the regularization penalty depends on its derivative at w 0 2 1 5 1 0 5 0 2 1 0 1 2 w2 2 versus w J w n X i 1 Tommi Jaakkola MIT CSAIL log P yi x w m X wi i 1 14 Combination of methods Similarly to feature selection we can select simple weak classification or regression methods and combine them into a single strong method Example techniques forward fitting regression boosting classification Tommi Jaakkola MIT CSAIL 15 Combination of regression methods We want to combine multiple weak regression methods into a single strong method f x f x 1 f x m Suppose we are given a family simple regression methods f x w k x where k w specifies the identity of the basis function as well as the associated weight Forward fitting sequentially introduce new simple regression methods to reduce the remaining prediction error Tommi Jaakkola MIT CSAIL 16 Forward fitting cont d Simple family f x w k x k w We can fit each new component to reduce the prediction error in each iteration we solve the same type of estimation problem n X Step 1 1 argmin yi f xi 2 Tommi Jaakkola MIT CSAIL i 1 17 Forward fitting cont d Simple family f x w k x k w We can fit each new component to reduce the prediction error in each iteration we solve the same type of estimation problem n X Step 1 1 argmin yi f xi 2 Step 2 Tommi Jaakkola MIT CSAIL 2 argmin i 1 n X y i f z xi 1 f xi 2 i 1 error 18 Forward fitting cont d Simple family f x w k x k w We can fit each new component to reduce the prediction error in each iteration we solve the same type of estimation problem n X Step 1 1 argmin yi f xi 2 Step 2 2 argmin Step 3 i 1 n X y i f z xi 1 f xi …
View Full Document
Unlocking...