EM for HMMsa.k.a. The Baum-Welch AlgorithmLearning HMMs from fully observable data is easyLearning HMMs from fully observable data is easyLog likelihood for HMMs when X is hiddenComputing Log likelihood for HMMs when X is hiddenComputing Log likelihood for HMMs when X is hidden – variable eliminationEM for HMMs when X is hiddenE-stepThe M-stepDecomposition of likelihood revisitedStarting state probability P(X1)Transition probability P(Xt|Xt-1)Observation probability P(Ot|Xt)E-step revisitedThe forwards-backwards algorithmE-step revisitedWhat can you do with EM for HMMs? 1 – Clustering sequencesWhat can you do with EM for HMMs? 2 – Exploiting unlabeled dataExploiting unlabeled data in clustering20 Newsgroups data – advantage of adding unlabeled data20 Newsgroups data – Effect of additional unlabeled dataExploiting unlabeled data in HMMsWhat you need to knowAcknowledgementsEM for Bayes NetsData likelihood for BNsMarginal likelihoodLog likelihood for BNs with hidden dataE-step for BNsThe M-step for BNsM-step for each CPTComputing expected countsData need not be hidden in the same wayWhat you need to know1EM for HMMsa.k.a. The Baum-Welch AlgorithmRecommended reading: “An Introduction to HMMs and Bayesian Networks,”Z. Ghahramani, Int. Journal of Pattern Recognition and AI, 15(1):9-42, (2001)Especially Section 4Machine Learning – 10701/15781Carlos GuestrinCarnegie Mellon UniversityApril 12th, 20062Learning HMMs from fully observable data is easyX1= {a,…z}O1= X5= {a,…z}X3= {a,…z} X4= {a,…z}X2= {a,…z}O2= O3= O4= O5= Learn 3 distributions:3Learning HMMs from fully observable data is easyX1= {a,…z}O1= X5= {a,…z}X3= {a,…z} X4= {a,…z}X2= {a,…z}O2= O3= O4= O5= Learn 3 distributions:What if O is observed, but X is hidden4Log likelihood for HMMs when Xis hidden Marginal likelihood – O is observed, X is missing For simplicity of notation, training data consists of only one sequence: If there were m sequences:5Computing Log likelihood for HMMs when X is hiddenX1= {a,…z}O1= X5= {a,…z}X3= {a,…z} X4= {a,…z}X2= {a,…z}O2= O3= O4= O5=6Computing Log likelihood for HMMswhen X is hidden – variable eliminationX1= {a,…z}O1= X5= {a,…z}X3= {a,…z} X4= {a,…z}X2= {a,…z}O2= O3= O4= O5= Can compute efficiently with variable elimination:7EM for HMMs when X is hiddenX1= {a,…z}O1= X5= {a,…z}X3= {a,…z} X4= {a,…z}X2= {a,…z}O2= O3= O4= O5= E-step: Use inference (forwards-backwards algorithm) M-step: Recompute parameters with weighted data8E-stepX1= {a,…z}O1= X5= {a,…z}X3= {a,…z} X4= {a,…z}X2= {a,…z}O2= O3= O4= O5= E-step computes probability of hidden vars x given o Will correspond to inference use forward-backward algorithm!9The M-stepX1= {a,…z}O1= X5= {a,…z}X3= {a,…z} X4= {a,…z}X2= {a,…z}O2= O3= O4= O5= Maximization step: Use expected counts instead of counts: If learning requires Count(x,o) Use EQ(t+1)[Count(x,o)]10Decomposition of likelihood revisited Likelihood optimization decomposes:X1= {a,…z}O1= X5= {a,…z}X3= {a,…z} X4= {a,…z}X2= {a,…z}O2= O3= O4= O5=11Starting state probability P(X1) Using expected counts P(X1=a) = θX1=a1213Transition probability P(Xt|Xt-1) Using expected counts P(Xt=a|Xt-1=b) = θXt=a|Xt-1=b14Observation probability P(Ot|Xt) Using expected counts P(Ot=a|Xt=b) = θOt=a|Xt=b15E-step revisited X1= {a,…z}O1= X5= {a,…z}X3= {a,…z} X4= {a,…z}X2= {a,…z}O2= O3= O4= O5= E-step computes probability of hidden vars xgiven o Must compute: Q(xt=a|o) – marginal probability of each position Q(xt+1=a,xt=b|o) – joint distribution between pairs of positions16The forwards-backwards algorithmX1= {a,…z}O1= X5= {a,…z}X3= {a,…z} X4= {a,…z}X2= {a,…z}O2= O3= O4= O5= Initialization: For i = 2 to n Generate a forwards factor by eliminating Xi-1 Initialization: For i = n-1 to 1 Generate a backwards factor by eliminating Xi+1 ∀ i, probability is:17E-step revisited X1= {a,…z}O1= X5= {a,…z}X3= {a,…z} X4= {a,…z}X2= {a,…z}O2= O3= O4= O5= E-step computes probability of hidden vars xgiven o Must compute: Q(xt=a|o) – marginal probability of each position Just forwards-backwards! Q(xt+1=a,xt=b|o) – joint distribution between pairs of positions Homework! ☺18What can you do with EM for HMMs? 1 – Clustering sequencesX1= {a,…z}O1= X5= {a,…z}X3= {a,…z} X4= {a,…z}X2= {a,…z}O2= O3= O4= O5= Independent clustering: Sequence clustering:19What can you do with EM for HMMs? 2 – Exploiting unlabeled dataX1= {a,…z}O1= X5= {a,…z}X3= {a,…z} X4= {a,…z}X2= {a,…z}O2= O3= O4= O5= Labeling data is hard work → save (graduate student) time by using both labeled and unlabeled data Labeled data: <X=“brace”,O= > Unlabeled data: <X=?????,O= >20Exploiting unlabeled data in clustering A few data points are labeled <x,o> Most points are unlabeled <?,o> In the E-step of EM: If i’th point is unlabeled: compute Q(X|oi) as usual If i’th point is labeled: set Q(X=x|oi)=1 and Q(X≠x|oi)=0 M-step as usual212220 Newsgroups data – advantage of adding unlabeled data2320 Newsgroups data – Effect of additional unlabeled data24Exploiting unlabeled data in HMMsX1= {a,…z}O1= X5= {a,…z}X3= {a,…z} X4= {a,…z}X2= {a,…z}O2= O3= O4= O5= A few data points are labeled <x,o> Most points are unlabeled <?,o> In the E-step of EM: If i’th point is unlabeled: compute Q(X|oi) as usual If i’th point is labeled: set Q(X=x|oi)=1 and Q(X≠x|oi)=0M-step as usual Speed up by remembering counts for labeled data25What you need to know Baum-Welch = EM for HMMs E-step: Inference using forwards-backwards M-step: Use weighted counts Exploiting unlabeled data: Some unlabeled data can help classification Small change to EM algorithm
View Full Document