EM cont Machine Learning 10701 15781 Carlos Guestrin Carnegie Mellon University November 26th 2007 2005 2007 Carlos Guestrin 1 Silly Example Let events be grades in a class w1 Gets an A P A w2 Gets a B P B w3 Gets a C P C 2 w4 Gets a D P D 3 Note 0 1 6 Assume we want to estimate from data In a given class there were a A s b B s c C s d D s What s the maximum likelihood estimate of given a b c d 2005 2007 Carlos Guestrin 2 Trivial Statistics P A P B P a b c d P C 2 P D 3 K a b 2 c 3 d log P a b c d log K alog blog clog 2 dlog 3 FOR MAX LIKE SET LogP 0 LogP b 2c 3d 0 2 1 2 3 b c Gives max like 6 b c d So if class got Max like A B C D 14 6 9 10 1 10 ue ut tr b ng Bori 3 2005 2007 Carlos Guestrin Same Problem with Hidden Information Someone tells us that Number of High grades A s B s h Number of C s c Number of D s d REMEMBER P A P B P C 2 P D 3 What is the max like estimate of now We can answer this question circularly EXPECTATION If we know the value of we could compute the expected value of a and b 1 2 h Since the ratio a b should be the same as the ratio a b h 1 1 2 2 MAXIMIZATION If we know the expected values of a and b we could compute the maximum likelihood value of 2005 2007 Carlos Guestrin b c 6 b c d 4 E M for our Trivial Problem REMEMBER P A P B We begin with a guess for We iterate between EXPECTATION and MAXIMALIZATION to improve our estimates of and a and b Define P C 2 P D 3 t the estimate of on the t th iteration b t the estimate of b on t th iteration 0 initial guess b t t 1 t h b t 1 t 2 E step b t c 6 b t c d M step max like est of given b t Continue iterating until converged Good news Converging to local optimum is assured Bad news I said local optimum 2005 2007 Carlos Guestrin 5 E M Convergence Convergence proof based on fact that Prob data must increase or remain same between each iteration NOT OBVIOUS But it can never exceed 1 OBVIOUS So it must therefore converge OBVIOUS In our example suppose we had h 20 c 10 d 10 0 0 t t Convergence is generally linear error decreases by a constant factor each time step 2005 2007 Carlos Guestrin b t 0 0 0 1 0 0833 2 857 2 0 0937 3 158 3 0 0947 3 185 4 0 0948 3 187 5 0 0948 3 187 6 0 0948 3 187 6 Back to Unsupervised Learning of GMMs a simple case A simple case We have unlabeled data x1 x2 xm We know there are k classes We know P y1 P y2 P y3 P yk We don t know 1 2 k We can write P data 1 k p x1 x m 1 k m p x j 1 k j 1 m k p x j i P y i j 1 i 1 m k 1 2 exp 2 x j i P y i 2 j 1 i 1 2005 2007 Carlos Guestrin 7 EM for simple case of GMMs The E step If we know 1 k easily compute prob point xj belongs to class y i 1 2 p y i x j 1 k exp 2 x j i P y i 2 2005 2007 Carlos Guestrin 8 EM for simple case of GMMs The M step If we know prob point xj belongs to class y i MLE for i is weighted average imagine k copies of each xj each with weight P y i xj m P y i x x j i j j 1 m P y i x j j 1 9 2005 2007 Carlos Guestrin E M for GMMs E step Compute expected classes of all datapoints for each class 1 2 p y i x j 1 k exp 2 x j i P y i 2 Just evaluate a Gaussian at xj M step Compute Max like given our data s class membership distributions m P y i x x j i j j 1 m P y i x j j 1 2005 2007 Carlos Guestrin 10 E M Convergence EM is coordinate ascent on an interesting potential function Coord ascent for bounded pot func convergence to a local optimum guaranteed See Neal Hinton reading on class webpage This algorithm is REALLY USED And in high dimensional state spaces too E G Vector Quantization for Speech Data 11 2005 2007 Carlos Guestrin E M for axis aligned GMMs Iterate On the t th iteration let our estimates be t 1 t 2 t k t 1 t 2 t k t p1 t p2 t pk t 21 0 0 L 0 0 2 t L 0 for0 0 2 pi 0is shorthand 0 0 estimate 2 3 L of P y i 0 0 on t th iteration M M M O M M 2 0 0 0 L m 1 0 2 0 0 0 L 0 m E step Compute expected classes of all datapoints for each class t t P y i x j t pi p x j i i t pi t is shorthand for estimate of P y i on t th iteration Just evaluate a Gaussian at xj M step Compute Max like given our data s class membership distributions P y i x x P y i x j t 1 i t j j j t j P y i x j pi t 1 t j m 2005 2007 Carlos Guestrin m records 12 E M for General GMMs pi t is shorthand for estimate of P y i on t th iteration Iterate On the t th iteration let our estimates be t 1 t 2 t k t 1 t 2 t k t p1 t p2 t pk t E step Compute expected classes of all datapoints for each class t t P y i x j t pi p x j i i t Just evaluate a Gaussian at xj M step Compute Max like given our data s class membership distributions P y i x x P y i x j t 1 i t t 1 j j P y i x x x P y i x j j i t 1 i j i t 1 T j t j j P y i x j pi j j t t 1 t t j m 2005 …
View Full Document