EM for HMMs a k a The Baum Welch Algorithm Machine Learning 10701 15781 Carlos Guestrin Carnegie Mellon University April 11th 2007 2005 2007 Carlos Guestrin 1 The general learning problem with missing data Marginal likelihood x is observed z is missing 2 2005 2007 Carlos Guestrin EM is coordinate ascent M step Fix Q maximize F over a lower bound on E step Fix maximize F over Q Realigns F with likelihood 3 2005 2007 Carlos Guestrin What you should know about EM K means for clustering algorithm converges because it s coordinate ascent EM for mixture of Gaussians How to learn maximum likelihood parameters locally max like in the case of unlabeled data Be happy with this kind of probabilistic analysis Remember E M can get stuck in local minima and empirically it DOES EM is coordinate ascent General case for EM 4 2005 2007 Carlos Guestrin Learning HMMs from fully observable data is easy X1 a z X2 a z X3 a z X4 a z X5 a z O1 O2 O3 O4 O5 Learn 3 distributions 5 2005 2007 Carlos Guestrin Learning HMMs from fully observable data is easy X1 a z X2 a z X3 a z X4 a z X5 a z O1 O2 O3 O4 O5 Learn 3 distributions What if O is observed but X is hidden 6 2005 2007 Carlos Guestrin Log likelihood for HMMs when X is hidden Marginal likelihood O is observed X is missing For simplicity of notation training data consists of only one sequence If there were m sequences 7 2005 2007 Carlos Guestrin Computing Log likelihood for HMMs when X is hidden X1 a z X2 a z X3 a z X4 a z X5 a z O1 O2 O3 O4 O5 8 2005 2007 Carlos Guestrin Computing Log likelihood for HMMs when X is hidden variable elimination X1 a z X2 a z X3 a z X4 a z X5 a z O1 O2 O3 O4 O5 Can compute efficiently with variable elimination 9 2005 2007 Carlos Guestrin EM for HMMs when X is hidden X1 a z X2 a z X3 a z X4 a z X5 a z O1 O2 O3 O4 O5 E step Use inference forwards backwards algorithm M step Recompute parameters with weighted data 10 2005 2007 Carlos Guestrin E step X1 a z X2 a z X3 a z X4 a z X5 a z O1 O2 O3 O4 O5 E step computes probability of hidden vars x given o Will correspond to inference use forward backward algorithm 11 2005 2007 Carlos Guestrin The M step X1 a z X2 a z X3 a z X4 a z X5 a z O1 O2 O3 O4 O5 Maximization step Use expected counts instead of counts If learning requires Count x o Use EQ t 1 Count x o 12 2005 2007 Carlos Guestrin Decomposition of likelihood revisited X1 a z X2 a z X3 a z X4 a z X5 a z O1 O2 O3 O4 O5 Likelihood optimization decomposes 13 2005 2007 Carlos Guestrin Starting state probability P X1 Using expected counts P X1 a X1 a 14 2005 2007 Carlos Guestrin Transition probability P Xt Xt 1 Using expected counts P Xt a Xt 1 b Xt a Xt 1 b 15 2005 2007 Carlos Guestrin Observation probability P Ot Xt Using expected counts P Ot a Xt b Ot a Xt b 16 2005 2007 Carlos Guestrin E step revisited X1 a z X2 a z X3 a z X4 a z X5 a z O1 O2 O3 O4 O5 E step computes probability of hidden vars x given o Must compute Q xt a o marginal probability of each position Q xt 1 a xt b o joint distribution between pairs of positions 17 2005 2007 Carlos Guestrin The forwards backwards algorithm X1 a z X2 a z X3 a z X4 a z X5 a z O1 O2 O3 O4 O5 Initialization For i 2 to n Generate a forwards factor by eliminating Xi 1 Initialization For i n 1 to 1 Generate a backwards factor by eliminating Xi 1 i probability is 18 2005 2007 Carlos Guestrin E step revisited X1 a z X2 a z X3 a z X4 a z X5 a z O1 O2 O3 O4 O5 E step computes probability of hidden vars x given o Must compute Q xt a o marginal probability of each position Just forwards backwards Q xt 1 a xt b o joint distribution between pairs of positions 19 2005 2007 Carlos Guestrin What can you do with EM for HMMs 1 Clustering sequences X1 a z X2 a z X3 a z X4 a z X5 a z O1 O2 O3 O4 O5 Independent clustering Sequence clustering 20 2005 2007 Carlos Guestrin What can you do with EM for HMMs 2 Exploiting unlabeled data X1 a z X2 a z X3 a z X4 a z X5 a z O1 O2 O3 O4 O5 Labeling data is hard work save graduate student time by using both labeled and unlabeled data Labeled data X brace O Unlabeled data X O 21 2005 2007 Carlos Guestrin Exploiting unlabeled data in clustering A few data points are labeled x o Most points are unlabeled o In the E step of EM If i th point is unlabeled compute Q X oi as usual If i th point is labeled set Q X x oi 1 and Q X x oi 0 M step as usual 22 2005 2007 Carlos Guestrin 23 2005 2007 Carlos Guestrin 20 Newsgroups data advantage of adding unlabeled data 24 2005 2007 Carlos Guestrin 20 Newsgroups data Effect of additional unlabeled data 25 2005 2007 Carlos Guestrin Exploiting unlabeled data in HMMs X1 a z X2 a z X3 a z X4 a z X5 a z O1 O2 O3 O4 O5 A few data points are labeled x o Most points are unlabeled o In the E step of EM If i th point is unlabeled compute Q X oi as usual If i th point is labeled set Q X x oi 1 and Q X x oi 0 M step as usual Speed up by remembering counts for labeled data 2005 2007 Carlos Guestrin 26 What you need to know Baum Welch EM for HMMs E step Inference using forwards backwards M step Use weighted counts Exploiting unlabeled data Some unlabeled data can help classification Small change to EM algorithm In E step only use inference for unlabeled data 27 2005 2007 Carlos Guestrin Acknowledgements Experiments combining labeled and unlabeled data provided by Tom Mitchell 28 2005 2007 Carlos Guestrin
View Full Document