MIT 6 867 - Lecture Notes - D2392316

Home> Schools> Massachusetts Institute of Technology> Electrical Engineering and Computer Science (6) > 6 867> Lecture Notes

DOC PREVIEW

MIT 6 867 - Lecture Notes

School name Massachusetts Institute of Technology

Course 6 867- Machine Learning

Pages 9

This preview shows page 1-2-3 out of 9 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 9 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 9 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 9 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 9 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

� � 1 6.867 Machine learning, lecture 12 (Jaakkola) Lecture topics: model selection criteria • Feature subset selection (cont’d) – information criterion – conditional log-likelihood and description length – regularization • Combination of classiﬁers, boosting Feature subset selection (cont’d) We have already discussed how to carry out feature selection with the Naive Bayes model. This model is easy to specify and estimate when the input vectors x are discrete (here vectors with binary {−1, 1} components). When we elect to use only a subset J of features for classiﬁcation, our model can be written as � �� P (x, y) = P (xi|y) P (xi) P (y) (1) i∈Ji�∈J Note that the f eatures not used for classiﬁcation, indexed by j �∈ J , are assumed indepen-dent of the label. This way we still have a distribution over the original input vectors x and labels but assert that only some of the components of x are relevant for classiﬁcation. The probabilities involved such as P (xi|y) are obtained directly from empirical counts involving xi and y (see previous lecture). The selection criterion we arrived at indicated that, as far as the log-likelihood of all the data is concerned, we should include features (replace P (xi) with P (xi|y) in the above model) in the decreasing order of Iˆ(Xi; Y ) = Hˆ(Xi) − Hˆ(Xi|Y ) (2) where the entropies Hˆ(Xi) and Hˆ(Xi|Y ) are evaluated on the basis of the estimated prob-abilities for the Naive Bayes model. The disadvantage of this criterion is that the features are selected individually, i.e., without regard to how eﬀective they might be in speciﬁc combinations with each other. This is a direct result of the Naive Bayes model as well as the fact that we opted to ﬁnd features that maximally help increase the log-likelihood of all the data. This is clearly slightly diﬀerent from trying to classify examples accurately. Cite as: Tommi Jaakkola, course materials for 6.867 Machine Learning, Fall 2006. MIT OpenCourseWare(http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].2 6.867 Machine learning, lecture 12 (Jaakkola) For example, from the point of view of classiﬁcation, it is not necessary to model the distri-bution over the feature vectors x. All we care about is the conditional probability P (y|x). Why should our estimation and the feature selection criterion then aim to increase the log-likelihood of generating the feature vectors as well? We will now turn to improving the feature selection part with MDL while still estimating the probabilities P (xi|y) and P (y) in closed form as before. Fe ature subset selection with MDL Let’s assume we have estimated the parameters in the Naive Bayes model, i.e ., Pˆ(xi|y) and Pˆ(y), as before by maximizing the log-likelihood of all the data. Let’s see what this model implies in terms of classiﬁcation: Pˆ(y = 1 x) = Pˆ(x, y = 1) = ˆ1 (3) |Pˆ(x, y = 1) + Pˆ(x, y = −1) P (x,y=−1)1 + Pˆ(x,y=1) 1 1 = = (4) Pˆ(x,y=1) 1 + exp(− log Pˆ(x,y=−1) ) 1 + exp(−fˆ(x)) where fˆ(x) = log Pˆ(x, y = 1) (5) Pˆ(x, y = −1) is the discriminant function arising from the Naive Bayes model. So, for example, when fˆ(x) > 0, the logistic function implies that Pˆ(y = 1|x) > 0.5 and we would classify the example as positive. If fˆ(x) is a linear function of the inputs x, then the form of the conditional probability from the Naive Bayes model would be exactly as in a logistic regression model. Let’s see if this is indeed so. fˆ(x) = log Pˆ(x, y = 1) = � log Pˆ(xi|y = 1) + log Pˆ(y = 1) (6) Pˆ(x, y = −1) i∈J Pˆ(xi|y = −1) Pˆ(y = −1) Note that only terms tried to the lab els remain. Is this a linear function of the binary features xi? Yes, it is. We can write each term as a linear function of xi as follows: log Pˆ(xi|y = 1) = δ(xi, 1) log Pˆ(xi = 1|y = 1) + δ(xi, −1) log Pˆ(xi = −1|y = 1) (7) Pˆ(xi|y = −1) Pˆ(xi = 1|y = −1) Pˆ(xi = −1|y = −1) = xi + 1 log Pˆ(xi = 1|y = 1) +1 − xi log Pˆ(xi = −1|y = 1) (8) 2 Pˆ(xi = 1|y = −1) 2 Pˆ(xi = −1|y = −1) Cite as: Tommi Jaakkola, course materials for 6.867 Machine Learning, Fall 2006. MIT OpenCourseWare(http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].� � � � 3 6.867 Machine learning, lecture 12 (Jaakkola) By combining these terms we can write the discriminant function in the usual linear form fˆ(x) = wˆixi + ˆw0 (9) i∈J where the parameters ˆwi and ˆw0 are functions of the Naive Bayes conditional probabilities. Note that while we ended up with a logistic regression model, the parameters of this conditional distribution, ˆwi and ˆw0, have been estimated quite diﬀerently from the logistic model. We can now evaluate the conditional probability Pˆ(y|x, J ) from the Naive Bayes model corresponding to any subset J of relevant features. Let’s go back to the feature selection problem. The subset J should be optimized to maximize the conditional log-likelihood of labels given examples. Equivalently, we can minimize the description length nDL-data(J ) = − log Pˆ(yt|xt, J ) (10) t=1 with respect to J . This is a criterion that evaluates how useful the features are for classiﬁ-cation and it can be no longer reduced to evaluating features independently of others. This also means that the optimization problem for ﬁnding J is a diﬃcult one. The simplest way of approximately minimizing this criterion would b e to start with no features, then include the single best feature, followed by the second feature that works best w ith the ﬁrst one, and so on. Note that in this simple setting the classiﬁer parameters associated with diﬀerent subsets of features are ﬁxed by the Naive Bayes model; we only optimize over the subset of features. The above sequential optimization of the criterion would yield features that are more useful for classiﬁcation than ranking them by mutual information (the ﬁrst feature to include would be the same, however). But we do not yet have a criterion for deciding how many features to include. In the MDL terminology, we need to describe the model as well. The more features we include the more bits we need to describe both the set and the parameters

View Full Document


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2-3 out of 9 pages.

MIT 6 867 - Lecture Notes

Sign up for free to view:

Please select your school