UW-Madison COMPSCI 838 Topic - Advanced NLP - The EM Algorithm

Unformatted text preview:

CS838-1 Advanced NLP:The EM AlgorithmXiaojin Zhu2007Send comments to [email protected]“Nice intuitions have nice mathematical explanations.”“Not every iterative procedure can b e c alled EM.”1 Naive Bayes RevisitedRecall in Naive Bayes models, we are given a training set {(x, y)1:n}, and ourgoal is to train a classifier that classifies any new document x into one of Kclasses. In the case of MLE, this is achieved by estimating parameters Θ ={π, θ1, . . . , θK}, where p(y = k) = πk, and p(x|y = k) = θk, to maximize thejoint log likelihood of the training data:Θ = arg maxπ ,θ1,...,θKlog p({(x, y)1:n}|π, θ1, . . . , θK). (1)The solution isπk=Pni=1[yi= k]n, k = 1, . . . , Kθkw=Pi:yi=kxiwPi:yi=kPVu=1xiu, k = 1, . . . , K. (2)Note π is a probability vector of length K over classes, and each θkis a proba-bility vector of length V over the vocabulary.Classification is done by computing p(y|x):ˆy = arg maxkp(y = k|x) (3)= arg maxkp(y = k)p(x|y = k) ; Bayes rule, ignore constant denominator(4)= arg maxkπkVYw=1θxwkw(5)= arg maxklog πk+VXw=1xwlog θkw; log is monotonic. (6)12 K-means Cluster ingSo far so good. What if the training data is mostly unlabeled? For example,let’s assume we have only one labeled example per class:{(x1, y1= 1), (x2, y2= 2), . . . , (xK, yK= K), xK+1, . . . , xn}.We can simply ignore unlabeled data {xK+1, . . . , xn} and train our Naive Bayesclassifier on labeled data. Can we do better?Here is an iterative procedure, known as K-means clustering, which we applyto our classification task:1. Estimate Θ(t=0)from labeled examples only.2. Repeat until things no longer change:(a) Classify all examples (labeled and unlabeled) x by its most likelyclass under the current model: ˆy = arg maxkp(y = k|x, Θ(t)).(b) Now we have a fully labeled dataset {(x, ˆy)1:n}. Retrain Θ(t+1)on it.Let t = t + 1.There are a couple details:• We re-classify all the labeled points. This is mostly for notational conve-nience. It is certainly fine (and probably more desirable) to fix their labelsduring the iterations. The derivation follows similarly, with separate termsfor the labeled data.• We use the K-means clustering algorithm for classification. But it wasoriginally designed for clustering. For example, one can randomly pickθ1:Kto start with, and the algorithm will converge to K final clusters. Inthat case, we do not have the correspondence between clusters and classes.• K-means clustering is usually presented on mixture of Gaussian distri-butions. Here we instead have mixture of multinomial distributions. Ineither case, θkis the ‘centroid’ of cluster k. The distance is measureddifferently though [1].3 The EM AlgorithmIn K-means we made a hard classification: each x is assigned a unique label ˆy.A soft version would be to use the posterior p(y|x). Intuitively, each xiis splitinto K copies, but copy k has weight γik= p(yi= k|xi) instead of one. We nowhave a dataset with fractional counts, but that poses no difficulty in retrainingthe paramete rs. One can show that the MLE is the weighted version of (2)πk=Pni=1γikn, k = 1, . . . , Kθkw=Pni=1γikxiwPni=1PVu=1γikxiu, k = 1, . . . , K. (7)2The change from hard to soft leads us to the EM (Expectation Maximization)algorithm:1. Estimate Θ(t=0)from labeled examples only.2. Repeat until convergence:(a) E-step: For i = 1 . . . n, k = 1 . . . K, compute γik= p(yi= k|xi, Θ(t)).(b) M-step: Compute Θ(t+1)from (7). Let t = t + 1.4 Analysis of EMEM might lo ok like a heuristic method. However, as we show now, it has arigorous foundation: EM is guaranteed to find a local optimum of data loglikelihoo d.Recall if we have complete data {(x, y)1:n} (as in standard Naive Bayes), thejoint log likelihood under parameters Θ islog p((x, y)1:n|Θ) =nXi=1log p(yi|Θ)p(xi|yi, Θ). (8)However, now y1:nare hidden variables. We instead maximize the marginal loglikelihoo d`(Θ) = log p(x1:n|Θ) (9)=nXi=1log p(xi|Θ) (10)=nXi=1logKXy =1p(xi, y|Θ) (11)=nXi=1logKXy =1p(y|Θ)p(xi|y, Θ). (12)We note that there is a summation inside the log. This couples the Θ parame-ters. If we try to maximize the marginal log likelihood by setting the gradient tozero, we will find that there is no longer a nice closed form solution, unlike thejoint log likelihood with complete data. The reader is encouraged to attemptthis to see the difference.EM is an iterative procedure to maximize the marginal log likelihood `(Θ).It constructs a concave, easy-to-optimize lower bound Q(Θ, Θ(t)), where Θ isthe variable and Θ(t)is the previous, fixed, parameter. The lower bound hasan interesting property Q(Θ(t), Θ(t)) = `(Θ(t)). Therefore the new parameterΘ(t+1)that maximizes Q is guaranteed to have Q ≥ `(Θ(t)). Since Q lowerbounds `, we have `(Θ(t+1)) ≥ `(Θ(t)).3The lower bound is obtained via Jensen’s inequalitylogXipifi≥Xipilog fi, (13)which holds if the pi’s form a probability distribution (i.e., non-negative andsum to 1). This follows from the concavity of log.`(Θ) =nXi=1logKXy =1p(xi, y|Θ) (14)=nXi=1logKXy =1p(y|xi, Θ(t))p(xi, y|Θ)p(y|xi, Θ(t))(15)≥nXi=1KXy =1p(y|xi, Θ(t)) logp(xi, y|Θ)p(y|xi, Θ(t))(16)≡ Q(Θ, Θ(t)). (17)Note we introduced a probability distribution p(y|xi, Θ(t)) ≡ γiyseparately foreach example xi. This is what E-step is computing.The M-step maximizes the lower bound Q(Θ, Θ(t)). It is worth noting thatnow we can set the gradient of Q to zero and obtain a closed form solution. Infact the solution is simply (7), and we call it Θ(t+1).It is eas y to see thatQ(Θ(t), Θ(t)) =nXi=1log p(xi|Θ(t)) = `(Θ(t)). (18)Since Θ(t+1)maximizes Q, we haveQ(Θ(t+1), Θ(t)) ≥ Q(Θ(t), Θ(t)) = `(Θ(t)). (19)On the other hand, Q lower bounds `. Therefore`(Θ(t+1)) ≥ Q(Θ(t+1), Θ(t)) ≥ Q(Θ(t), Θ(t)) = `(Θ(t)). (20)This shows that Θ(t+1)is indeed a better (or no worse) parameter than Θ(t)in terms of the marginal log likelihood `. By iterating, we arrive at a localmaximum of `.5 Deeper Analysis of EMYou might have noticed that we never referred to the concrete model p(x|y), p(y)(Naive Bayes) in the above analysis, except when we say the solution is sim-ply (7). Does this suggest that EM is more general than Naive Bayes? Besides,where did the particular probability distribution p(y|xi, Θ(t)) come from in (15)?4The answer to the first question is yes. EM applies to joint probabilitymodels where some random variables are missing. It is advantageous when themarginal is hard to optimize, but the


View Full Document

UW-Madison COMPSCI 838 Topic - Advanced NLP - The EM Algorithm

Download Advanced NLP - The EM Algorithm
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Advanced NLP - The EM Algorithm and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Advanced NLP - The EM Algorithm 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?