Massachusetts Institute of Technology 6 867 Machine Learning Fall 2006 Problem Set 4 Due Date Thursday Nov 9 12 00 noon You may submit your solutions in class or in the box 1 We will explore here the use of the Bayesian Information Criterion BIC for model selection speci cally to select features for a Naive Bayes model The input consists of d dimensional binary feature vectors x x1 xd T where xi 1 1 and binary class labels y 1 1 A Naive Bayes model may be used for example to classify emails as spam y 1 or not spam y 1 The body of each email can be represented as a bag of words i e we ignore frequency placement and grammar etc We could restrict ourselves to a dictionary of d words w1 wd and coordinate xi could serve as an indicator of word wi xi 1 if wi is present in the email and xi 1 if it is absent Recall that the Naive Bayes model assumes that the features are conditionally independent given the label so that P x y d P xi y P y 1 i 1 We parameterize our model by introducing separate parameters for each component d P x y P xi y i where 2 i 1 xi 1 1 xi P xi y i i y2 1 i y2 3 4 Here 1 d T and i i 1 i 1 T so that i 1 P xi 1 y 1 i 1 P xi 1 y 1 In class we have already discussed how to compute the Maximum Likelihood estimates of these param eters We will take a Bayesian approach however and introduce a prior probability over the parameters as P P i y d P i y i 1 y 1 1 r r r 2 ir y 1 i y r 1 r 1 Cite as Tommi Jaakkola course materials for 6 867 Machine Learning Fall 2006 MIT OpenCourseWare http ocw mit edu Massachusetts Institute of Technology Downloaded on DD Month YYYY 5 6 where r and r are hyper parameters common across all the i y s and for integer k k 1 k The hyper parameters characterize our beliefs about the parameters prior to seeing any data You may assume here that r and r are non negative integers a Eqn 5 speci es the prior P i y as a Beta distribution This choice of the prior is particularly convenient as we will see now Show that the posterior distribution P D given n training examples xj yj j 1 n has the following form P D L D P d m m i y i yi y 1 i y i 1 y 1 1 where L D is the likelihood function What are m i y and mi y Your answer should use the n y and n iy xi y notation used in the lectures e g n ky 1 1 is the number of examples where y 1 and xk 1 You have just shown that the Beta distribution is a conjugate prior to the Binomial distribution In other words if the prior probability P has the form of a Beta distribution and the likelihood P D has the form of a Binomial distribution then the posterior probability P D will be a Beta distribution When the class labels and features are not binary but take on k values the same relationship holds between the Dirichlet and Multinomial distributions b Recall that in the Bayesian scheme for model selection our aim is to nd the model that maximizes the marginal likelihood P D Fd which is the normalization constant for the posterior P D Fd L D P d where Fd denotes the model Compute a closed form expression for P D Fd Hint use the form of the normalization constant for the beta distribution c Let us now use our results to choose between two models F1 and F2 The only di erence between them is in how they use feature i F2 involves P xi y term that connects the feature to the label y whereas F1 only has P xi assuming that feature xi is independent of the label In F2 the distribution of xi s is parameterized by two parameters i 1 i 1 T as described before In contrast F1 only requires a single parameter i P xi i i xi 1 2 1 i 1 xi 2 7 The prior probability over i is the same as Eqn 6 P i i r 1 i r 8 The expressions of the marginal likelihood from the two models will di er only in terms of feature i To compare the two models we can ignore all the other terms From your results in part b extract from P D F2 the factors involving feature i Use a similar analysis to evaluate the terms involving feature i in P D F1 Using these evaluate the decision rule log P D F2 P D F1 0 to include feature xi Cite as Tommi Jaakkola course materials for 6 867 Machine Learning Fall 2006 MIT OpenCourseWare http ocw mit edu Massachusetts Institute of Technology Downloaded on DD Month YYYY d Optional Extra Credit For many models the marginal likelihood is di cult to compute exactly and we have to an asymptotic approximation BIC D Fr log L D M L r log n 2 We will now see roughly speaking how to relate BIC and P D F in the general case Our intuition is as follows as the number of examples n increases the posterior distribution P D becomes more and more peaked around the maximum likelihood estimate M L In the limit n the posterior distribution becomes a point estimate at M L We then approximate the integral by evaluating it only in the neighbourhood of M L We start by writing P D Fr exp log L D P d and perform a Taylor series expansion of log L D around M L 1 log L D log L D M L M L T A1 M L T A2 M L where 2 log L D A1 M L A2 2 log L D T ML The matrix A2 has interesting properties it s called the Fisher Information Matrix but its most relevant property here is that the determinant A2 nC r where C r does not depend on the number of examples n We have performed a second order Taylor series expansion i e up to the second derivative What would the problem be if we had performed only a rst order expansion e Optional Extra Credit Now assume that the prior is uninformative i e P 1 strictly speaking it should be 1 d but the constant factor will not matter Show that with the second order approximation L D P d L D M L 2 n d 2 C1 r where C1 r consists of terms that do not depend on the number of examples n f Optional Extra Credit Using your result from part g argue that lim …
View Full Document
Unlocking...