1EM for HMMsa.k.a. The Baum-Welch AlgorithmMachine Learning – 10701/15781Carlos GuestrinCarnegie Mellon UniversityApril 12th, 2006Recommended reading: “An Introduction to HMMs and Bayesian Networks,”Z. Ghahramani, Int. Journal of Pattern Recognition and AI, 15(1):9-42, (2001)Especially Section 42Learning HMMsfrom fully observable data is easyX1= {a,…z}O1= X5= {a,…z}X3= {a,…z} X4= {a,…z}X2= {a,…z}O2= O3= O4= O5= Learn 3 distributions:3Learning HMMsfrom fully observable data is easyX1= {a,…z}O1= X5= {a,…z}X3= {a,…z} X4= {a,…z}X2= {a,…z}O2= O3= O4= O5= Learn 3 distributions:What if O is observed, but X is hidden4Log likelihood for HMMswhen Xis hidden Marginal likelihood – O is observed, X is missing For simplicity of notation, training data consists of only one sequence: If there were m sequences:5Computing Log likelihood for HMMs when X is hiddenX1= {a,…z}O1= X5= {a,…z}X3= {a,…z} X4= {a,…z}X2= {a,…z}O2= O3= O4= O5=6Computing Log likelihood for HMMswhen X is hidden – variable elimination Can compute efficiently with variable elimination:X1= {a,…z}O1= X5= {a,…z}X3= {a,…z} X4= {a,…z}X2= {a,…z}O2= O3= O4= O5=7EM for HMMs when X is hidden E-step: Use inference (forwards-backwards algorithm) M-step: Recompute parameters with weighted dataX1= {a,…z}O1= X5= {a,…z}X3= {a,…z} X4= {a,…z}X2= {a,…z}O2= O3= O4= O5=8E-step E-step computes probability of hidden vars x given o Will correspond to inference use forward-backward algorithm!X1= {a,…z}O1= X5= {a,…z}X3= {a,…z} X4= {a,…z}X2= {a,…z}O2= O3= O4= O5=9The M-step Maximization step: Use expected counts instead of counts: If learning requires Count(x,o) Use EQ(t+1)[Count(x,o)]X1= {a,…z}O1= X5= {a,…z}X3= {a,…z} X4= {a,…z}X2= {a,…z}O2= O3= O4= O5=10Decomposition of likelihood revisited Likelihood optimization decomposes:X1= {a,…z}O1= X5= {a,…z}X3= {a,…z} X4= {a,…z}X2= {a,…z}O2= O3= O4= O5=11Starting state probability P(X1) Using expected counts P(X1=a) = θX1=a12Transition probability P(Xt|Xt-1) Using expected counts P(Xt=a|Xt-1=b) = θXt=a|Xt-1=b13Observation probability P(Ot|Xt) Using expected counts P(Ot=a|Xt=b) = θOt=a|Xt=b14E-step revisited E-step computes probability of hidden vars xgiven o Must compute: Q(xt=a|o) – marginal probability of each position Q(xt+1=a,xt=b|o) – joint distribution between pairs of positionsX1= {a,…z}O1= X5= {a,…z}X3= {a,…z} X4= {a,…z}X2= {a,…z}O2= O3= O4= O5=15The forwards-backwards algorithmX1= {a,…z}O1= X5= {a,…z}X3= {a,…z} X4= {a,…z}X2= {a,…z}O2= O3= O4= O5= Initialization: For i = 2 to n Generate a forwards factor by eliminating Xi-1 Initialization: For i = n-1 to 1 Generate a backwards factor by eliminating Xi+1 ∀ i, probability is:16E-step revisited E-step computes probability of hidden vars xgiven o Must compute: Q(xt=a|o) – marginal probability of each position Just forwards-backwards! Q(xt+1=a,xt=b|o) – joint distribution between pairs of positions Homework! ☺X1= {a,…z}O1= X5= {a,…z}X3= {a,…z} X4= {a,…z}X2= {a,…z}O2= O3= O4= O5=17What can you do with EM for HMMs? 1 – Clustering sequencesX1= {a,…z}O1= X5= {a,…z}X3= {a,…z} X4= {a,…z}X2= {a,…z}O2= O3= O4= O5= Independent clustering: Sequence clustering:18What can you do with EM for HMMs? 2 – Exploiting unlabeled dataX1= {a,…z}O1= X5= {a,…z}X3= {a,…z} X4= {a,…z}X2= {a,…z}O2= O3= O4= O5= Labeling data is hard work → save (graduate student) time by using both labeled and unlabeled data Labeled data: <X=“brace”,O= > Unlabeled data: <X=?????,O= >19Exploiting unlabeled data in clustering A few data points are labeled <x,o> Most points are unlabeled <?,o> In the E-step of EM: If i’th point is unlabeled: compute Q(X|oi) as usual If i’th point is labeled: set Q(X=x|oi)=1 and Q(X≠x|oi)=0 M-step as usual202120 Newsgroups data – advantage of adding unlabeled data2220 Newsgroups data – Effect of additional unlabeled data23Exploiting unlabeled data in HMMs A few data points are labeled <x,o> Most points are unlabeled <?,o> In the E-step of EM: If i’th point is unlabeled: compute Q(X|oi) as usual If i’th point is labeled: set Q(X=x|oi)=1 and Q(X≠x|oi)=0 M-step as usual Speed up by remembering counts for labeled dataX1= {a,…z}O1= X5= {a,…z}X3= {a,…z} X4= {a,…z}X2= {a,…z}O2= O3= O4= O5=24What you need to know Baum-Welch = EM for HMMs E-step: Inference using forwards-backwards M-step: Use weighted counts Exploiting unlabeled data: Some unlabeled data can help classification Small change to EM algorithm In E-step, only use inference for unlabeled data25Acknowledgements Experiments combining labeled and unlabeled data provided by Tom Mitchell26EM for Bayes NetsMachine Learning – 10701/15781Carlos GuestrinCarnegie Mellon UniversityApril 12th, 200627Data likelihood for BNs Given structure, log likelihood of fully observed data:FluAllergySinusHeadacheNose28Marginal likelihood What if S is hidden?FluAllergySinusHeadacheNose29Log likelihood for BNswith hidden data Marginal likelihood – O is observed, H is hiddenFluAllergySinusHeadacheNose30E-step for BNs E-step computes probability of hidden vars h given o Corresponds to inference in BNFluAllergySinusHeadacheNose31The M-step for BNs Maximization step: Use expected counts instead of counts: If learning requires Count(h,o) Use EQ(t+1)[Count(h,o)]FluAllergySinusHeadacheNose32M-step for each CPT M-step decomposes per CPT Standard MLE: M-step uses expected counts:FluAllergySinusHeadacheNose33Computing expected counts M-step requires expected counts: For a set of vars A, must compute ExCount(A=a) Some of A in example j will be observed denote by AO=
View Full Document