©2005-2007 Carlos Guestrin1EM for HMMsa.k.a. The Baum-WelchAlgorithmMachine Learning – 10701/15781Carlos GuestrinCarnegie Mellon UniversityApril 11th, 20072©2005-2007 Carlos GuestrinThe general learning problem withmissing data Marginal likelihood – x is observed, z is missing:3©2005-2007 Carlos GuestrinEM is coordinate ascent M-step: Fix Q, maximize F over θ (a lower bound on ): E-step: Fix θ, maximize F over Q: “Realigns” F with likelihood:4©2005-2007 Carlos GuestrinWhat you should know about EM K-means for clustering: algorithm converges because it’s coordinate ascent EM for mixture of Gaussians: How to “learn” maximum likelihood parameters (locally max. like.) inthe case of unlabeled data Be happy with this kind of probabilistic analysis Remember, E.M. can get stuck in local minima, andempirically it DOES EM is coordinate ascent General case for EM5©2005-2007 Carlos GuestrinLearning HMMs from fullyobservable data is easyX1 = {a,…z}O1 = X5 = {a,…z}X3 = {a,…z} X4 = {a,…z}X2 = {a,…z}O2 = O3 = O4 = O5 = Learn 3 distributions:6©2005-2007 Carlos GuestrinLearning HMMs from fullyobservable data is easyX1 = {a,…z}O1 = X5 = {a,…z}X3 = {a,…z} X4 = {a,…z}X2 = {a,…z}O2 = O3 = O4 = O5 = Learn 3 distributions:What if O is observed, but X is hidden7©2005-2007 Carlos GuestrinLog likelihood for HMMs when X ishidden Marginal likelihood – O is observed, X is missing For simplicity of notation, training data consists of only one sequence: If there were m sequences:8©2005-2007 Carlos GuestrinComputing Log likelihood forHMMs when X is hiddenX1 = {a,…z}O1 = X5 = {a,…z}X3 = {a,…z} X4 = {a,…z}X2 = {a,…z}O2 = O3 = O4 = O5 =9©2005-2007 Carlos GuestrinComputing Log likelihood for HMMswhen X is hidden – variable elimination Can compute efficiently with variable elimination:X1 = {a,…z}O1 = X5 = {a,…z}X3 = {a,…z} X4 = {a,…z}X2 = {a,…z}O2 = O3 = O4 = O5 =10©2005-2007 Carlos GuestrinEM for HMMs when X is hidden E-step: Use inference (forwards-backwards algorithm) M-step: Recompute parameters with weighted dataX1 = {a,…z}O1 = X5 = {a,…z}X3 = {a,…z} X4 = {a,…z}X2 = {a,…z}O2 = O3 = O4 = O5 =11©2005-2007 Carlos GuestrinE-step E-step computes probability of hidden vars x given o Will correspond to inference use forward-backward algorithm!X1 = {a,…z}O1 = X5 = {a,…z}X3 = {a,…z} X4 = {a,…z}X2 = {a,…z}O2 = O3 = O4 = O5 =12©2005-2007 Carlos GuestrinThe M-step Maximization step: Use expected counts instead of counts: If learning requires Count(x,o) Use EQ(t+1)[Count(x,o)]X1 = {a,…z}O1 = X5 = {a,…z}X3 = {a,…z} X4 = {a,…z}X2 = {a,…z}O2 = O3 = O4 = O5 =13©2005-2007 Carlos GuestrinDecomposition of likelihoodrevisited Likelihood optimization decomposes:X1 = {a,…z}O1 = X5 = {a,…z}X3 = {a,…z} X4 = {a,…z}X2 = {a,…z}O2 = O3 = O4 = O5 =14©2005-2007 Carlos GuestrinStarting state probability P(X1) Using expected counts P(X1=a) = θX1=a15©2005-2007 Carlos GuestrinTransition probability P(Xt|Xt-1) Using expected counts P(Xt=a|Xt-1=b) = θXt=a|Xt-1=b16©2005-2007 Carlos GuestrinObservation probability P(Ot|Xt) Using expected counts P(Ot=a|Xt=b) = θOt=a|Xt=b17©2005-2007 Carlos GuestrinE-step revisited E-step computes probability of hidden vars x given o Must compute: Q(xt=a|o) – marginal probability of each position Q(xt+1=a,xt=b|o) – joint distribution between pairs ofpositionsX1 = {a,…z}O1 = X5 = {a,…z}X3 = {a,…z} X4 = {a,…z}X2 = {a,…z}O2 = O3 = O4 = O5 =18©2005-2007 Carlos GuestrinThe forwards-backwards algorithmX1 = {a,…z}O1 = X5 = {a,…z}X3 = {a,…z} X4 = {a,…z}X2 = {a,…z}O2 = O3 = O4 = O5 = Initialization: For i = 2 to n Generate a forwards factor by eliminating Xi-1 Initialization: For i = n-1 to 1 Generate a backwards factor by eliminating Xi+1 8 i, probability is:19©2005-2007 Carlos GuestrinE-step revisited E-step computes probability of hidden vars xgiven o Must compute: Q(xt=a|o) – marginal probability of each position Just forwards-backwards! Q(xt+1=a,xt=b|o) – joint distribution between pairsof positionsX1 = {a,…z}O1 = X5 = {a,…z}X3 = {a,…z} X4 = {a,…z}X2 = {a,…z}O2 = O3 = O4 = O5 =20©2005-2007 Carlos GuestrinWhat can you do with EM for HMMs? 1– Clustering sequencesX1 = {a,…z}O1 = X5 = {a,…z}X3 = {a,…z} X4 = {a,…z}X2 = {a,…z}O2 = O3 = O4 = O5 = Independent clustering: Sequence clustering:21©2005-2007 Carlos GuestrinWhat can you do with EM for HMMs? 2– Exploiting unlabeled dataX1 = {a,…z}O1 = X5 = {a,…z}X3 = {a,…z} X4 = {a,…z}X2 = {a,…z}O2 = O3 = O4 = O5 = Labeling data is hard work ! save (graduate student) timeby using both labeled and unlabeled data Labeled data: <X=“brace”,O= > Unlabeled data: <X=?????,O= >22©2005-2007 Carlos GuestrinExploiting unlabeled data inclustering A few data points are labeled <x,o> Most points are unlabeled <?,o> In the E-step of EM: If i’th point is unlabeled: compute Q(X|oi) as usual If i’th point is labeled: set Q(X=x|oi)=1 and Q(X≠x|oi)=0 M-step as usual23©2005-2007 Carlos Guestrin24©2005-2007 Carlos Guestrin20 Newsgroups data – advantageof adding unlabeled data25©2005-2007 Carlos Guestrin20 Newsgroups data – Effect ofadditional unlabeled data26©2005-2007 Carlos GuestrinExploiting unlabeled data in HMMs A few data points are labeled <x,o> Most points are unlabeled <?,o> In the E-step of EM: If i’th point is unlabeled: compute Q(X|oi) as usual If i’th point is labeled: set Q(X=x|oi)=1 and Q(X≠x|oi)=0 M-step as usual Speed up by remembering counts for labeled dataX1 = {a,…z}O1 = X5 = {a,…z}X3 = {a,…z} X4 = {a,…z}X2 = {a,…z}O2 = O3 = O4 = O5 =27©2005-2007 Carlos GuestrinWhat you need to know Baum-Welch = EM for HMMs E-step: Inference using
View Full Document