Machine Learning 1010 701 15701 15 781 Spring 2008 Expectation Maximization Eric Xing Lecture 16 March 19 2008 Reading Chap 9 C B book Eric Xing 1 Clustering Eric Xing 2 1 Unobserved Variables z A variable can be unobserved latent because z it is an imaginary quantity meant to provide some simplified and abstractive view of the data generation process z it is a real world object and or phenomena but difficult or impossible to measure z it is a real world object and or phenomena but sometimes wasn t measured because of faulty sensors or was measure with a noisy channel etc z z z e g speech recognition models mixture models e g the temperature of a star causes of a disease evolutionary ancestors e g traffic radio aircraft signal on a radar screen z Discrete latent variables can be used to partition cluster data into sub groups mixture models forthcoming z Continuous latent variables factors can be used for dimensionality reduction factor analysis etc later lectures Eric Xing 3 Mixture Models Eric Xing 4 2 Mixture Models con d z A density model p x may be multi modal z We may be able to model it as a mixture of uni modal distributions e g Gaussians z Each mode may correspond to a different sub population e g male and female Eric Xing 5 Gaussian Mixture Models GMMs z Consider a mixture of K Gaussian components z Z Z is a latent class indicator vector X p zn multi z n k zn k k z X is a conditional Gaussian variable with a class specific mean covariance p xn znk 1 z 1 2 m 2 k 1 2 exp 21 xn k T k 1 xn k The likelihood of a sample p xn k p z 1 p x z 1 k z Eric Xing k k k z nk n mixture component mixture proportion N xn k k zn k k N x k k k 6 3 Gaussian Mixture Models GMMs z Consider a mixture of K Gaussian components p xn k k N x k k mixture proportion z mixture component This model can be used for unsupervised clustering z This model fit by AutoClass has been used to discover new kinds of stars in astronomical data etc Eric Xing 7 Why is Learning Harder z In fully observed iid settings the log likelihood decomposes into a sum of local terms lc D log p x z log p z z log p x z x z With latent variables all the parameters become coupled together via marginalization lc D log p x z log p z z p x z x z Eric Xing z 8 4 Gradient Learning for mixture models z We can learn mixture densities using gradient descent on the log likelihood The gradients are quite interesting l log p x log k pk x k k p x k l 1 k k p x k k k p x pk x k log pk x k p x k log pk x k l k k rk k k k p x k k z In other words the gradient is the responsibility weighted sum of the individual log likelihood gradients z Can pass this to a conjugate gradient routine Eric Xing 9 Parameter Constraints z z Often we have constraints on the parameters e g k k 1 being symmetric positive definite hence ii 0 We can use constrained optimization or we can reparameterize in terms of unconstrained values z For normalized weights use the softmax transform k z exp k j exp j For covariance matrices use the Cholesky decomposition 1 AT A where A is upper diagonal with positive diagonal Aii exp i 0 Aij ij j i Aij 0 j i the parameters i i ij R are unconstrained z Eric Xing Use chain rule to compute l l A 10 5 Identifiability z A mixture model induces a multi modal likelihood z Hence gradient ascent can only find a local maximum z Mixture models are unidentifiable since we can always switch the hidden labels without affecting the likelihood z Hence we should be careful in trying to interpret the meaning of latent variables Eric Xing 11 Toward the EM algorithm z E g A mixture of K Gaussians z Zn Xn Z is a latent class indicator vector N p z n multi z n k n zk k z X is a conditional Gaussian variable with a class specific mean covariance p xn z nk 1 z 1 2 m 2 k 1 2 exp 21 xn k T k 1 xn k The likelihood of a sample p xn k p z k 1 p x z k 1 z Eric Xing z nk n k k N xn k k z n k k N x k k k 12 6 Toward the EM algorithm z Recall MLE for completely observed data z Data log likelihood zi xi l D log p z n xn log p z n p xn z n N n n log kzn log N xn k zn k n k k n k z nk log k z nk n MLE z k n 1 2 2 xn k 2 C k k MLE arg max l D k MLE arg max l D k MLE k MLE arg max l D z x z k n n n n k n What is we do not know zn z Eric Xing 13 Expectation Maximization EM Algorithm z EM is an optimization strategy for objective functions that can be interpreted as likelihoods in the presence of missing data z It is much simpler than gradient methods z z No need to choose step size z Enforces constraints automatically z Calls inference and fully observed learning as subroutines EM is an Iterative algorithm with two linked steps z z z E step fill in hidden values using inference p z x t M step update parameters t 1 using standard MLE MAP method applied to completed data We will prove that this procedure monotonically improves or leaves it unchanged Thus it always converges to a local optimum of the likelihood Eric Xing 14 7 K means z Start z z Guess the centroid k and coveriance k of each of the K clusters Loop z For each point n 1 to N compute its cluster label t zn t arg max xn k t T 1 xn k t k k z For each cluster k 1 K k t 1 z k x z k t n n n t n n kt 1 Eric Xing 15 Expectation Maximization z Start z z Guess the centroid k and coveriance k of each of the K clusters Loop Eric Xing 16 8 Example Gaussian mixture model z A mixture of K Gaussians z Zn Z is a latent class indicator vector p zn multi zn k zn k Xn k z p xn z nk 1 z N X is a conditional Gaussian variable with a class specific mean covariance …
View Full Document