©2005-2007 Carlos Guestrin1EM for HMMsa.k.a. The Baum-Welch AlgorithmMachine Learning – 10701/15781Carlos GuestrinCarnegie Mellon UniversityApril 11th, 20072©2005-2007 Carlos GuestrinThe general learning problem with missing data Marginal likelihood – x is observed, z is missing:3©2005-2007 Carlos GuestrinEM is coordinate ascent M-step: Fix Q, maximize F over θ (a lower bound on ): E-step: Fix θ, maximize F over Q: “Realigns” F with likelihood:4©2005-2007 Carlos GuestrinWhat you should know about EM K-means for clustering: algorithm converges because it’s coordinate ascent EM for mixture of Gaussians: How to “learn” maximum likelihood parameters (locally max. like.) in the case of unlabeled data Be happy with this kind of probabilistic analysis Remember, E.M. can get stuck in local minima, and empirically it DOES EM is coordinate ascent General case for EM5©2005-2007 Carlos GuestrinLearning HMMs from fully observable data is easyX1= {a,…z}O1= X5= {a,…z}X3= {a,…z} X4= {a,…z}X2= {a,…z}O2= O3= O4= O5= Learn 3 distributions:6©2005-2007 Carlos GuestrinLearning HMMs from fully observable data is easyX1= {a,…z}O1= X5= {a,…z}X3= {a,…z} X4= {a,…z}X2= {a,…z}O2= O3= O4= O5= Learn 3 distributions:What if O is observed, but X is hidden7©2005-2007 Carlos GuestrinLog likelihood for HMMs when Xis hidden Marginal likelihood – O is observed, X is missing For simplicity of notation, training data consists of only one sequence: If there were m sequences:8©2005-2007 Carlos GuestrinComputing Log likelihood for HMMs when X is hiddenX1= {a,…z}O1= X5= {a,…z}X3= {a,…z} X4= {a,…z}X2= {a,…z}O2= O3= O4= O5=9©2005-2007 Carlos GuestrinComputing Log likelihood for HMMswhen X is hidden – variable elimination Can compute efficiently with variable elimination:X1= {a,…z}O1= X5= {a,…z}X3= {a,…z} X4= {a,…z}X2= {a,…z}O2= O3= O4= O5=10©2005-2007 Carlos GuestrinEM for HMMs when X is hidden E-step: Use inference (forwards-backwards algorithm) M-step: Recompute parameters with weighted dataX1= {a,…z}O1= X5= {a,…z}X3= {a,…z} X4= {a,…z}X2= {a,…z}O2= O3= O4= O5=11©2005-2007 Carlos GuestrinE-step E-step computes probability of hidden vars x given o Will correspond to inference use forward-backward algorithm!X1= {a,…z}O1= X5= {a,…z}X3= {a,…z} X4= {a,…z}X2= {a,…z}O2= O3= O4= O5=12©2005-2007 Carlos GuestrinThe M-step Maximization step: Use expected counts instead of counts: If learning requires Count(x,o) Use EQ(t+1)[Count(x,o)]X1= {a,…z}O1= X5= {a,…z}X3= {a,…z} X4= {a,…z}X2= {a,…z}O2= O3= O4= O5=13©2005-2007 Carlos GuestrinDecomposition of likelihood revisited Likelihood optimization decomposes:X1= {a,…z}O1= X5= {a,…z}X3= {a,…z} X4= {a,…z}X2= {a,…z}O2= O3= O4= O5=14©2005-2007 Carlos GuestrinStarting state probability P(X1) Using expected counts P(X1=a) = θX1=a15©2005-2007 Carlos GuestrinTransition probability P(Xt|Xt-1) Using expected counts P(Xt=a|Xt-1=b) = θXt=a|Xt-1=b16©2005-2007 Carlos GuestrinObservation probability P(Ot|Xt) Using expected counts P(Ot=a|Xt=b) = θOt=a|Xt=b17©2005-2007 Carlos GuestrinE-step revisited E-step computes probability of hidden vars x given o Must compute: Q(xt=a|o) – marginal probability of each position Q(xt+1=a,xt=b|o) – joint distribution between pairs of positionsX1= {a,…z}O1= X5= {a,…z}X3= {a,…z} X4= {a,…z}X2= {a,…z}O2= O3= O4= O5=18©2005-2007 Carlos GuestrinThe forwards-backwards algorithmX1= {a,…z}O1= X5= {a,…z}X3= {a,…z} X4= {a,…z}X2= {a,…z}O2= O3= O4= O5= Initialization: For i = 2 to n Generate a forwards factor by eliminating Xi-1 Initialization: For i = n-1 to 1 Generate a backwards factor by eliminating Xi+1 ∀ i, probability is:19©2005-2007 Carlos GuestrinE-step revisited E-step computes probability of hidden vars xgiven o Must compute: Q(xt=a|o) – marginal probability of each position Just forwards-backwards! Q(xt+1=a,xt=b|o) – joint distribution between pairs of positionsX1= {a,…z}O1= X5= {a,…z}X3= {a,…z} X4= {a,…z}X2= {a,…z}O2= O3= O4= O5=20©2005-2007 Carlos GuestrinWhat can you do with EM for HMMs? 1 – Clustering sequencesX1= {a,…z}O1= X5= {a,…z}X3= {a,…z} X4= {a,…z}X2= {a,…z}O2= O3= O4= O5= Independent clustering: Sequence clustering:21©2005-2007 Carlos GuestrinWhat can you do with EM for HMMs? 2 – Exploiting unlabeled dataX1= {a,…z}O1= X5= {a,…z}X3= {a,…z} X4= {a,…z}X2= {a,…z}O2= O3= O4= O5= Labeling data is hard work → save (graduate student) time by using both labeled and unlabeled data Labeled data: <X=“brace”,O= > Unlabeled data: <X=?????,O= >22©2005-2007 Carlos GuestrinExploiting unlabeled data in clustering A few data points are labeled <x,o> Most points are unlabeled <?,o> In the E-step of EM: If i’th point is unlabeled: compute Q(X|oi) as usual If i’th point is labeled: set Q(X=x|oi)=1 and Q(X≠x|oi)=0 M-step as usual23©2005-2007 Carlos Guestrin24©2005-2007 Carlos Guestrin20 Newsgroups data – advantage of adding unlabeled data25©2005-2007 Carlos Guestrin20 Newsgroups data – Effect of additional unlabeled data26©2005-2007 Carlos GuestrinExploiting unlabeled data in HMMs A few data points are labeled <x,o> Most points are unlabeled <?,o> In the E-step of EM: If i’th point is unlabeled: compute Q(X|oi) as usual If i’th point is labeled: set Q(X=x|oi)=1 and Q(X≠x|oi)=0 M-step as usual Speed up by remembering counts for labeled dataX1= {a,…z}O1= X5= {a,…z}X3= {a,…z} X4= {a,…z}X2= {a,…z}O2= O3= O4= O5=27©2005-2007 Carlos GuestrinWhat you need to know Baum-Welch = EM for HMMs E-step: Inference using forwards-backwards M-step: Use weighted counts Exploiting unlabeled data: Some unlabeled data can help
View Full Document