DOC PREVIEW
MIT 6 867 - Problem Set 4

This preview shows page 1-2 out of 6 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 6 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 6 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 6 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

� � � � � � � � Massachusetts Institute of Technology 6.867 Machine Learning, Fall 2006 Problem Set 4 Due Date: Thursday, Nov 9, 12:00 noon You may submit your solutions in class or in the box.1. We will explore here the use of the Bayesian Information Criterion (BIC) for model selection, specifically to select features for a Naive Bayes model. The input consists of d-dimensional binary feature vectors x = [x1, . . . , xd]T where xi ∈ {−1, 1} and binary class labels y ∈ {−1, 1}. A Naive Bayes model may be used, for example, to classify emails as spam (y = 1) or not-spam (y = −1). The body of each email can be represented as a bag of words (i.e., we ignore frequency, placement, and grammar etc.). We could restrict ourselves to a dictionary of d words {w1, . . . , wd}. and coordinate xi could serve as an indicator of word wi: xi = 1 if wi is present in the email and xi = −1 if it is absent. Recall that the Naive Bayes model assumes that the features are conditionally independent given the label so that dP (x, y) = P (xi|y) P (y) (1) i=1 We parameterize our model by introducing separate parameters for each component: dP (x|y, θ) = P (xi|y, θi) where (2) i=1 xi+1 1−xi 2 2P (xi|y, θi) = θi|y (1 − θi|y ) (3) (4) Here θ = [θ1, . . . , θd]T and θi = [θi|1, θi|−1]T so that θi|1 = P (xi = 1|y = 1) θi|−1 = P (xi = 1|y = −1) In class, we have already discussed how to compute the Maximum Likelihood estimates of these param-eters. We will take a Bayesian approach, however, and introduce a prior probability over the parameters θ as: dP (θ) = P (θi|y) (5) i=1 y∈{1,−1} Γ(r+ + r− + 2) + P (θi|y) = Γ(r+ + 1)Γ(r− + 1) · θir|y (1 − θi|y)r− (6) Cite as: Tommi Jaakkola, course materials for 6.867 Machine Learning, Fall 2006. MIT OpenCourseWare(http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].� where r+ and r− are hyper-parameters, common across all the θi|y’s, and, for integer k, Γ(k + 1) = k!. The hyper-parameters characterize our beliefs about the parameters prior to seeing any data. You may assume here that r+ and r− are non-negative integers. (a) Eqn 5 specifies the prior P (θi|y) as a Beta distribution. This choice of the prior is particularly convenient, as we will see now. Show that the posterior distribution P (θ|D), given n training examples {(xj , yj ) | j = 1, . . . , n}, has the following form: +�d� mi|y )m−P (θ|D) ∝ L(D; θ)P (θ) = θi|y (1 − θi|yi|y i=1 y∈{−1,1} where L(D; θ) is the likelihood function. What are m + i|y and m−i|y ? Your answer s hould use the nˆy and ˆniy(xi, y) notation used in the lectures; e.g., ˆnky(1, −1) is the number of examples where y = −1 and xk = 1. You have just shown that the Beta distribution is a conjugate prior to the Binomial distribution. In other words, if the prior probability P (θ) has the form of a Beta distribution, and the likelihood P (D|θ) has the form of a Binomial distribution, then the posterior probability P (θ|D) will be a Beta distribution. When the class labels and features are not binary but take on k values, the same relationship holds between the Dirichlet and Multinomial distributions. (b) Recall that, in the Bayesian scheme for model selection, our aim is to find the model that maximizes the marginal likelihood, P (D|Fd), which is the normalization constant for the posterior: P (D|Fd) = L(D; θ)P (θ)dθ where Fd denotes the model. Compute a closed-form expression for P (D|Fd). (Hint: use the form of the normalization constant for the beta distribution). (c) Let us now use our results to choos e between two models F1 and F2. The only difference between them is in how they use feature i: F2 involves P (xi|y) term that connects the feature to the label y whereas F1 only has P (xi), assuming that feature xi is independent of the label. In F2 the distribution of xi’s is parameterized by two parameters [θi|1, θi|−1]T as described before. In contrast, F1 only requires a single parameter θi�, xi+1 1−xi2P (xi|θ�) = θ� (1 − θ�) 2 (7)ii iThe prior probability over θi�is the same as Eqn 6: P (θi�) ∝ θi�r+ (1 − θi�)r− (8) The expressions of the marginal likelihood from the two models will differ only in terms of feature i. To compare the two models we can ignore all the other terms. From your results in part (b), extract from P (D|F2) the factors involving feature i. Use a similar analysis to evaluate the terms involving feature i in P (D|F1). Using these, evaluate the “decision rule”, log[P (D|F2)/P (D|F1)] > 0 to include feature xi. Cite as: Tommi Jaakkola, course materials for 6.867 Machine Learning, Fall 2006. MIT OpenCourseWare(http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].� ������ � �(d) Optional (Extra Credit): For many models the marginal likelihood is difficult to compute exactly and we have to an asymptotic approximation: BIC(D, Fr) = log L(D; θˆML) − r log n2 We will now see (roughly speaking) how to relate BIC and P (D|F) in the general case. Our intuition is as follows: as the number of examples n increases the posterior distribution P (θ|D) becomes more and more peaked around the maximum likelihood estimate θˆML. In the limit n → ∞, the posterior distribution becomes a point estimate at θˆML. We then approximate the integral by evaluating it only in the neighbourhood of θˆML. We start by writing: P (D|Fr) = exp(log L(D; θ))P (θ)dθ and perform a Taylor series expansion of log L(D; θ) around θˆML: 1 θML)T A1 +ˆlog ( ) + (;L θ θD −ML����ˆ θˆML)T A2(θ − θˆML),log L(D; θ) (θ − where = 2 ∂ log L(D; θ)A1 = ∂θ θˆML ∂2 log L(D; θ)A2 = ∂θ∂θT θˆML The matrix A2 has interesting properties (it’s called the Fisher Information Matrix), but its most relevant property here is that the determinant |A2| ≈ nC(r) where C(r) does not depend on the number of examples n. We have performed a second-order Taylor series expansion, i.e. up to the second derivative. What would the problem be if we had performed only a first-order expansion? (e) Optional (Extra Credit): Now, assume that the prior is uninformative, i.e P (θ) speaking, it


View Full Document

MIT 6 867 - Problem Set 4

Download Problem Set 4
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Problem Set 4 and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Problem Set 4 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?