Unformatted text preview:

6 867 Machine learning lecture 12 Jaakkola 1 Lecture topics model selection criteria Feature subset selection cont d information criterion conditional log likelihood and description length regularization Combination of classi ers boosting Feature subset selection cont d We have already discussed how to carry out feature selection with the Naive Bayes model This model is easy to specify and estimate when the input vectors x are discrete here vectors with binary 1 1 components When we elect to use only a subset J of features for classi cation our model can be written as P x y P xi y P xi P y 1 i J i J Note that the features not used for classi cation indexed by j J are assumed indepen dent of the label This way we still have a distribution over the original input vectors x and labels but assert that only some of the components of x are relevant for classi cation The probabilities involved such as P xi y are obtained directly from empirical counts involving xi and y see previous lecture The selection criterion we arrived at indicated that as far as the log likelihood of all the data is concerned we should include features replace P xi with P xi y in the above model in the decreasing order of I Xi Y H Xi H Xi Y 2 where the entropies H Xi and H Xi Y are evaluated on the basis of the estimated prob abilities for the Naive Bayes model The disadvantage of this criterion is that the features are selected individually i e without regard to how e ective they might be in speci c combinations with each other This is a direct result of the Naive Bayes model as well as the fact that we opted to nd features that maximally help increase the log likelihood of all the data This is clearly slightly di erent from trying to classify examples accurately Cite as Tommi Jaakkola course materials for 6 867 Machine Learning Fall 2006 MIT OpenCourseWare http ocw mit edu Massachusetts Institute of Technology Downloaded on DD Month YYYY 6 867 Machine learning lecture 12 Jaakkola 2 For example from the point of view of classi cation it is not necessary to model the distri bution over the feature vectors x All we care about is the conditional probability P y x Why should our estimation and the feature selection criterion then aim to increase the log likelihood of generating the feature vectors as well We will now turn to improving the feature selection part with MDL while still estimating the probabilities P xi y and P y in closed form as before Feature subset selection with MDL Let s assume we have estimated the parameters in the Naive Bayes model i e P xi y and P y as before by maximizing the log likelihood of all the data Let s see what this model implies in terms of classi cation P y 1 x P x y 1 P x y 1 P x y 1 1 1 exp log P x y 1 P x y 1 1 1 P x y 1 P x y 1 1 1 exp f x 3 4 where P x y 1 f x log P x y 1 5 is the discriminant function arising from the Naive Bayes model So for example when f x 0 the logistic function implies that P y 1 x 0 5 and we would classify the example as positive If f x is a linear function of the inputs x then the form of the conditional probability from the Naive Bayes model would be exactly as in a logistic regression model Let s see if this is indeed so P x y 1 P xi y 1 P y 1 f x log log log P x y 1 i J P xi y 1 P y 1 6 Note that only terms tried to the labels remain Is this a linear function of the binary features xi Yes it is We can write each term as a linear function of xi as follows log P xi y 1 P xi y 1 xi 1 log P xi 1 y 1 P xi 1 y 1 xi 1 log P xi 1 y 1 7 P xi 1 y 1 xi 1 P xi 1 y 1 1 xi P xi 1 y 1 log log 8 2 2 P xi 1 y 1 P xi 1 y 1 Cite as Tommi Jaakkola course materials for 6 867 Machine Learning Fall 2006 MIT OpenCourseWare http ocw mit edu Massachusetts Institute of Technology Downloaded on DD Month YYYY 6 867 Machine learning lecture 12 Jaakkola 3 By combining these terms we can write the discriminant function in the usual linear form f x w i xi w 0 9 i J where the parameters w i and w 0 are functions of the Naive Bayes conditional probabilities Note that while we ended up with a logistic regression model the parameters of this conditional distribution w i and w 0 have been estimated quite di erently from the logistic model We can now evaluate the conditional probability P y x J from the Naive Bayes model corresponding to any subset J of relevant features Let s go back to the feature selection problem The subset J should be optimized to maximize the conditional log likelihood of labels given examples Equivalently we can minimize the description length n DL data J log P yt xt J 10 t 1 with respect to J This is a criterion that evaluates how useful the features are for classi cation and it can be no longer reduced to evaluating features independently of others This also means that the optimization problem for nding J is a di cult one The simplest way of approximately minimizing this criterion would be to start with no features then include the single best feature followed by the second feature that works best with the rst one and so on Note that in this simple setting the classi er parameters associated with di erent subsets of features are xed by the Naive Bayes model we only optimize over the subset of features The above sequential optimization of the criterion would yield features that are more useful for classi cation than ranking them by mutual information the rst feature to include would be the same however But we do not yet have a criterion for deciding how many features to include In the MDL terminology we need to describe the model as well The more features we include the more bits we need to describe both the set and the parameters associated with using that set the Naive Bayes conditional probabilities So rst we need to communicate the set size and the elements The any integer J can be communicated with the cost of log J log J log log J 11 nats The elements in the set assuming a priori that they are drawn uniformly at random from d possible features require d log 12 J Cite as Tommi …


View Full Document

MIT 6 867 - Lecture Notes

Loading Unlocking...
Login

Join to view Lecture Notes and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture Notes and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?