DOC PREVIEW
MIT 6 867 - Lecture Notes

This preview shows page 1-2-3 out of 9 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 9 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 9 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 9 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 9 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

� � 1 6.867 Machine learning, lecture 12 (Jaakkola) Lecture topics: model selection criteria • Feature subset selection (cont’d) – information criterion – conditional log-likelihood and description length – regularization • Combination of classifiers, boosting Feature subset selection (cont’d) We have already discussed how to carry out feature selection with the Naive Bayes model. This model is easy to specify and estimate when the input vectors x are discrete (here vectors with binary {−1, 1} components). When we elect to use only a subset J of features for classification, our model can be written as � �� � P (x, y) = P (xi|y) P (xi) P (y) (1) i∈Ji�∈J Note that the f eatures not used for classification, indexed by j �∈ J , are assumed indepen-dent of the label. This way we still have a distribution over the original input vectors x and labels but assert that only some of the components of x are relevant for classification. The probabilities involved such as P (xi|y) are obtained directly from empirical counts involving xi and y (see previous lecture). The selection criterion we arrived at indicated that, as far as the log-likelihood of all the data is concerned, we should include features (replace P (xi) with P (xi|y) in the above model) in the decreasing order of Iˆ(Xi; Y ) = Hˆ(Xi) − Hˆ(Xi|Y ) (2) where the entropies Hˆ(Xi) and Hˆ(Xi|Y ) are evaluated on the basis of the estimated prob-abilities for the Naive Bayes model. The disadvantage of this criterion is that the features are selected individually, i.e., without regard to how effective they might be in specific combinations with each other. This is a direct result of the Naive Bayes model as well as the fact that we opted to find features that maximally help increase the log-likelihood of all the data. This is clearly slightly different from trying to classify examples accurately. Cite as: Tommi Jaakkola, course materials for 6.867 Machine Learning, Fall 2006. MIT OpenCourseWare(http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].2 6.867 Machine learning, lecture 12 (Jaakkola) For example, from the point of view of classification, it is not necessary to model the distri-bution over the feature vectors x. All we care about is the conditional probability P (y|x). Why should our estimation and the feature selection criterion then aim to increase the log-likelihood of generating the feature vectors as well? We will now turn to improving the feature selection part with MDL while still estimating the probabilities P (xi|y) and P (y) in closed form as before. Fe ature subset selection with MDL Let’s assume we have estimated the parameters in the Naive Bayes model, i.e ., Pˆ(xi|y) and Pˆ(y), as before by maximizing the log-likelihood of all the data. Let’s see what this model implies in terms of classification: Pˆ(y = 1 x) = Pˆ(x, y = 1) = ˆ1 (3) |Pˆ(x, y = 1) + Pˆ(x, y = −1) P (x,y=−1)1 + Pˆ(x,y=1) 1 1 = = (4) Pˆ(x,y=1) 1 + exp(− log Pˆ(x,y=−1) ) 1 + exp(−fˆ(x)) where fˆ(x) = log Pˆ(x, y = 1) (5) Pˆ(x, y = −1) is the discriminant function arising from the Naive Bayes model. So, for example, when fˆ(x) > 0, the logistic function implies that Pˆ(y = 1|x) > 0.5 and we would classify the example as positive. If fˆ(x) is a linear function of the inputs x, then the form of the conditional probability from the Naive Bayes model would be exactly as in a logistic regression model. Let’s see if this is indeed so. fˆ(x) = log Pˆ(x, y = 1) = � log Pˆ(xi|y = 1) + log Pˆ(y = 1) (6) Pˆ(x, y = −1) i∈J Pˆ(xi|y = −1) Pˆ(y = −1) Note that only terms tried to the lab els remain. Is this a linear function of the binary features xi? Yes, it is. We can write each term as a linear function of xi as follows: log Pˆ(xi|y = 1) = δ(xi, 1) log Pˆ(xi = 1|y = 1) + δ(xi, −1) log Pˆ(xi = −1|y = 1) (7) Pˆ(xi|y = −1) Pˆ(xi = 1|y = −1) Pˆ(xi = −1|y = −1) = xi + 1 log Pˆ(xi = 1|y = 1) +1 − xi log Pˆ(xi = −1|y = 1) (8) 2 Pˆ(xi = 1|y = −1) 2 Pˆ(xi = −1|y = −1) Cite as: Tommi Jaakkola, course materials for 6.867 Machine Learning, Fall 2006. MIT OpenCourseWare(http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].� � � � 3 6.867 Machine learning, lecture 12 (Jaakkola) By combining these terms we can write the discriminant function in the usual linear form fˆ(x) = wˆixi + ˆw0 (9) i∈J where the parameters ˆwi and ˆw0 are functions of the Naive Bayes conditional probabilities. Note that while we ended up with a logistic regression model, the parameters of this conditional distribution, ˆwi and ˆw0, have been estimated quite differently from the logistic model. We can now evaluate the conditional probability Pˆ(y|x, J ) from the Naive Bayes model corresponding to any subset J of relevant features. Let’s go back to the feature selection problem. The subset J should be optimized to maximize the conditional log-likelihood of labels given examples. Equivalently, we can minimize the description length nDL-data(J ) = − log Pˆ(yt|xt, J ) (10) t=1 with respect to J . This is a criterion that evaluates how useful the features are for classifi-cation and it can be no longer reduced to evaluating features independently of others. This also means that the optimization problem for finding J is a difficult one. The simplest way of approximately minimizing this criterion would b e to start with no features, then include the single best feature, followed by the second feature that works best w ith the first one, and so on. Note that in this simple setting the classifier parameters associated with different subsets of features are fixed by the Naive Bayes model; we only optimize over the subset of features. The above sequential optimization of the criterion would yield features that are more useful for classification than ranking them by mutual information (the first feature to include would be the same, however). But we do not yet have a criterion for deciding how many features to include. In the MDL terminology, we need to describe the model as well. The more features we include the more bits we need to describe both the set and the parameters


View Full Document

MIT 6 867 - Lecture Notes

Download Lecture Notes
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture Notes and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture Notes 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?