11School of Computer ScienceLearning Partially Observed GM: the Expectation-Maximization algorithmProbabilistic Graphical Models (10Probabilistic Graphical Models (10--708)708)Lecture 11, Oct 24, 2007Eric XingEric XingReceptor AKinase CTF FGene GGene HKinase EKinase DReceptor BX1X2X3X4X5X6X7X8Receptor AKinase CTF FGene GGene HKinase EKinase DReceptor BX1X2X3X4X5X6X7X8X1X2X3X4X5X6X7X8Reading: J-Chap. 10,11; KF-Chap. 17Eric Xing 2Partially observed GMsz Speech recognitionA AA AX2X3X1XTY2Y3Y1YT... ...2Eric Xing 3Partially observed GMz Biological EvolutionancestorACQhQmT years?AGAGACEric Xing 4Mixture Models3Eric Xing 5Mixture Models, con'dz A density model p(x) may be multi-modal.z We may be able to model it as a mixture of uni-modal distributions (e.g., Gaussians).z Each mode may correspond to a different sub-population (e.g., male and female).⇒Eric Xing 6Unobserved Variablesz A variable can be unobserved (latent) because:z it is an imaginary quantity meant to provide some simplified andabstractive view of the data generation processz e.g., speech recognition models, mixture models …z it is a real-world object and/or phenomena, but difficult or impossible to measurez e.g., the temperature of a star, causes of a disease, evolutionary ancestors …z it is a real-world object and/or phenomena, but sometimes wasn’t measured, because of faulty sensors, etc.z Discrete latent variables can be used to partition/cluster data into sub-groups.z Continuous latent variables (factors) can be used for dimensionality reduction (factor analysis, etc).4Eric Xing 7Gaussian Mixture Models (GMMs)z Consider a mixture of K Gaussian components:z This model can be used for unsupervised clustering.z This model (fit by AutoClass) has been used to discover new kinds of stars in astronomical data, etc.∑Σ=ΣkkkknxNxp ),|,(),(µπµmixture proportion mixture componentEric Xing 8Gaussian Mixture Models (GMMs)z Consider a mixture of K Gaussian components:zZis a latent class indicator vector:zXis a conditional Gaussian variable with a class-specific mean/covariancez The likelihood of a sample:()∏==kzknnknzzpππ):(multi)({})-()-(-exp)(),,|(//knkTknkmknnxxzxpµµπµ121212211−ΣΣ=Σ=()()∑∑∏∑Σ=Σ=Σ===ΣkkkkzkzkknzkkkknxNxNzxpzpxpnknkn),|,(),:(),,|,()|(),(µπµπµπµ11mixture proportionmixture componentZX5Eric Xing 9Why is Learning Harder?z In fully observed iid settings, the log likelihood decomposes into a sum of local terms (at least for directed models).z With latent variables, all the parameters become coupled together via marginalization),|(log)|(log)|,(log);(xzczxpzpzxpDθθθθ+==l∑∑==zxzzczxpzpzxpD ),|()|(log)|,(log);(θθθθlEric Xing 10z Recall MLE for completely observed dataz Data log-likelihoodz MLEz What if we do not know zn?CxzzxNzxpzpxzpDnkknknnkkknnkzknnkzknnnnnnnknkn+−=+===∑∑∑∑∑∏∑∏∏∏221)-(log),;(loglog),,|()|(log),(log);(2µπσµπσµπσθlToward the EM algorithmzixiN ),;(maxargˆ,DMLEkθlππ=);(maxargˆ,DMLEkθlµµ=);(maxargˆ,DMLEkθlσσ=∑∑=⇒nknnnknMLEkzxz,ˆ µ6Eric Xing 11Questionz “ … We solve problem X using Expectation-Maximization …”z What does it mean?z Ez What do we take Expectation with?z What do we take expectation over?z Mz What do we maximize?z What do we maximize with respect to?Eric Xing 12Recall: K-means)()(maxarg)()()()( tkntkTtknktnxxzµµ−Σ−=−1∑∑=+ntnnntntkkzxkz),(),()()()(δδµ17Eric Xing 13Expectation-Maximizationz Start: z "Guess" the centroidµkand coverianceΣkof each of the K clusters z LoopEric Xing 14Example: Gaussian mixture modelz A mixture of K Gaussians:zZis a latent class indicator vectorzXis a conditional Gaussian variable with class-specific mean/covariancez The likelihood of a sample:z The expected complete log likelihoodZnXnN()∏==kzknnknzzpππ):(multi)({})-()-(-exp)(),,|(//knkTknkmknnxxzxp µµπµ121212211−ΣΣ=Σ=()()∑∑∏∑Σ=Σ=Σ===ΣkkkkzkzkknzkkkknxNxNzxpzpxpnknkn),|,(),:(),,|,()|(),(µπµπµπµ11()∑∑∑∑∑∑+Σ+−Σ−−=Σ+=−nkkknkTknknnkkknnxzpnnnxzpncCxxzzzxpzpzxlog)()(21log),,|(log)|(log),;(1)|()|(µµπµπθl8Eric Xing 15z We maximize iteratively using the following iterative procedure:─ Expectation step: computing the expected value of the sufficient statistics of the hidden variables (i.e., z) given current est. of the parameters (i.e., πand µ). z Here we are essentially doing inference∑ΣΣ=Σ===ititintitktkntkttknqkntknxNxNxzpzt),|,(),|,(),,|1()()()()()()()()()()(µπµπµτ)(θclE-stepEric Xing 16Recall out objectivez The expected complete log likelihood()∑∑∑∑+Σ+−Σ−−=−nkkknkTknknnkkknCxxzz log)()(21log1µµπ∑∑Σ+=nxzpnnnxzpnczxpzpzx)|()|(),,|(log)|(log),;(µπθl9Eric Xing 17z We maximize iteratively using the following iterative procudure:─ Maximization step: compute the parameters under current results of the expected value of the hidden variablesz This is isomorphic to MLE except that the variables that are hidden are replaced by their expectations (in general they will by replaced by their corresponding "sufficient statistics") )(θclM-step 1 s.t. , ,0)( ,)(maxarg)(*k*)(NnNNzkllkntknnqknkkccktk===⇒=∀=⇒=∑∑∑∂∂τππππθθ∑∑=⇒=+ntknnntkntkkxl)()()1(* ,)(maxargττµµθ∑∑+++−−=Σ⇒=ΣntknnTtkntkntkntkkxxl)()1()1()()1(*))(( ,)(maxargτµµτθTTTxxxx=∂∂=∂∂−− AA A AAlog:Fact11Eric Xing 18Compare: K-means and EMz K-meansz In the K-means “E-step” we do hard assignment:z In the K-means “M-step” we update the means as the weighted sum of the data, but now the weights are 0 or 1:z EMz E-stepz M-step)()(maxarg)()()()( tkntkTtknktnxxzµµ−Σ−=−1∑∑=+ntnnntntkkzxkz),(),()()()(δδµ1The EM algorithm for mixtures of Gaussians is like a "soft version" of the K-means algorithm.∑ΣΣ=Σ===ititintitktkntkttknqkntknxNxNxzpzt),|,(),|,(),,|1()()()()()()()()()()(µπµπµτ∑∑=+ntknnntkntkx)()()1(ττµ10Eric Xing 19Theory underlying EMz What are we doing?z Recall that according to MLE, we intend to learn the model parameter that would have maximize the likelihood of the data. z But we do not observe z, so computing is difficult!z What shall we do?∑∑==zxzzczxpzpzxpD ),|()|(log)|,(log);(θθθθlEric Xing 20Complete & Incomplete Log Likelihoodsz Complete log likelihoodLet Xdenote the observable variable(s), and Zdenote the latent variable(s). If Zcould be observed, thenz Usually,
View Full Document