Expectation Maximization Machine Learning 10701 15781 Carlos Guestrin Carnegie Mellon University April 10th 2006 1 Announcements Reminder Project milestone due Wednesday beginning of class 2 Coordinate descent algorithms Want mina minb F a b Coordinate descent fix a minimize b fix b minimize a repeat Converges if F is bounded to a often good local optimum as we saw in applet play with it K means is a coordinate descent algorithm 3 Expectation Maximalization 4 Back to Unsupervised Learning of GMMs a simple case Remember We have unlabeled data x1 x2 xm We know there are k classes We know P y1 P y2 P y3 P yk We don t know 1 2 k We can write P data 1 k p x1 xm 1 k m p x j 1 k j 1 m k p x j i P y i j 1 i 1 2 1 exp 2 x j i P y i 2 j 1 i 1 m k 5 EM for simple case of GMMs The E step If we know 1 k easily compute prob point xj belongs to class y i 2 1 p y i x j 1 k exp 2 x j i P y i 2 6 EM for simple case of GMMs The M step If we know prob point xj belongs to class y i MLE for i is weighted average imagine k copies of each xj each with weight P y i xj m P y i x x j i j j 1 m P y i x j j 1 7 E M for GMMs E step Compute expected classes of all datapoints for each class 2 1 p y i x j 1 k exp 2 x j i P y i 2 Just evaluate a Gaussian at xj M step Compute Max like given our data s class membership distributions m P y i x x j i j j 1 m P y i x j j 1 8 E M for General GMMs pi t is shorthand for estimate of P y i Iterate On the t th iteration let our estimates be on t th iteration t 1 t 2 t k t 1 t 2 t k t p1 t p2 t pk t E step Compute expected classes of all datapoints for each class t t P y i x j t pi p x j i i t Just evaluate a Gaussian at xj M step Compute Max like given our data s class membership distributions i t 1 P y i x x P y i x j t j i j j x P y i x x P y i x t 1 t 1 j t j i i T j j t j t 1 t j j P y i x j pi t 1 j m t m records 9 Gaussian Mixture Example Start 10 After first iteration 11 After 2nd iteration 12 After 3rd iteration 13 After 4th iteration 14 After 5th iteration 15 After 6th iteration 16 After 20th iteration 17 Some Bio Assay data 18 GMM clustering of the assay data 19 Resulting Density Estimator 20 Three classes of assay each learned with it s own mixture model 21 Resulting Bayes Classifier 22 Resulting Bayes Classifier using posterior probabilities to alert about ambiguity and anomalousness Yellow means anomalous Cyan means ambiguous 23 The general learning problem with missing data Marginal likelihood x is observed z is missing 24 E step x is observed z is missing Compute probability of missing data given current choice of Q z xj for each xj e g probability computed during classification step corresponds to classification step in K means 25 Jensen s inequality Theorem log z P z f z z P z log f z 26 Applying Jensen s inequality Use log z P z f z z P z log f z 27 The M step maximizes lower bound on weighted data Lower bound from Jensen s Corresponds to weighted dataset x1 z 1 with weight Q t 1 z 1 x1 x1 z 2 with weight Q t 1 z 2 x1 x1 z 3 with weight Q t 1 z 3 x1 x2 z 1 with weight Q t 1 z 1 x2 x2 z 2 with weight Q t 1 z 2 x2 x2 z 3 with weight Q t 1 z 3 x2 28 The M step Maximization step Use expected counts instead of counts If learning requires Count x z Use EQ t 1 Count x z 29 Convergence of EM Define potential function F Q EM corresponds to coordinate ascent on F Thus maximizes lower bound on marginal log likelihood 30 M step is easy Using potential function 31 E step also doesn t decrease potential function 1 Fixing to t 32 KL divergence Measures distance between distributions KL zero if and only if Q P 33 E step also doesn t decrease potential function 2 Fixing to t 34 E step also doesn t decrease potential function 3 Fixing to t Maximizing F t Q over Q set Q to posterior probability Note that 35 EM is coordinate ascent M step Fix Q maximize F over a lower bound on E step Fix maximize F over Q Realigns F with likelihood 36 What you should know K means for clustering algorithm converges because it s coordinate ascent EM for mixture of Gaussians How to learn maximum likelihood parameters locally max like in the case of unlabeled data Be happy with this kind of probabilistic analysis Remember E M can get stuck in local minima and empirically it DOES EM is coordinate ascent 37 General case for EM Acknowledgements K means Gaussian mixture models presentation contains material from excellent tutorial by Andrew Moore http www autonlab org tutorials K means Applet http www elet polimi it upload matteucc Clustering tu torial html AppletKM html Gaussian mixture models Applet http www neurosci aist go jp 7Eakaho MixtureEM html 38 EM for HMMs a k a The Baum Welch Algorithm Machine Learning 10701 15781 Carlos Guestrin Carnegie Mellon University April 10th 2006 39 Learning HMMs from fully observable data is easy X1 a z X2 a z X3 a z X4 a z X5 a z O1 O2 O3 O4 O5 Learn 3 distributions 40 Learning HMMs from fully observable data is easy X1 a z X2 a z X3 a z X4 a z X5 a z O1 O2 O3 O4 O5 Learn 3 distributions What if O is observed but X is hidden 41 Log likelihood for HMMs with hidden X Marginal likelihood O is observed X is missing For simplicity of notation we ll consider training data consists of only one sequence If there were m sequences 42 E step X1 a z X2 a z X3 a z X4 a z X5 a z O1 O2 O3 O4 O5 E step computes probability of hidden vars x given o Will correspond to inference use forward backward algorithm 43 The M step X1 a z X2 a z X3 a …
View Full Document