Machine Learning ! !! ! ! Srihari 1 Latent Variable View of EM Sargur Srihari [email protected] Learning ! !! ! ! Srihari Examples of latent variables 1. Mixture Model 2. Hidden Markov Model – Joint distribution is p(x,z)!• We don’t have values for z!2 – A single time slice is a mixture with components p(x|z) – An extension of mixture model • Choice of mixture component depends on choice of mixture component for previous distribution – Latent variables are multinomial variables zn • That describe component responsible for generating xnMachine Learning ! !! ! ! Srihari Another example of latent variables 3. Topic Models (Latent Dirichlet Allocation) – In NLP unobserved groups explain why some observed data are similar – Each document is a mixture of various topics (latent variables) – Topics generate words • CAT-related: milk, meow, kitten • DOG-related: puppy, bark, bone – Multinomial distributions over words with Dirichlet priors 3Machine Learning ! !! ! ! Srihari ML as a subfield of AI 4Machine Learning ! !! ! ! Srihari 5 Main Idea of EM • Goal of EM is: – find maximum likelihood models for distributions p(x) that have latent (or missing) data • E.g., GMMs, HMMs • In case of Gaussian mixture models – We have a complex distribution of observed variables x – We wish to estimate its parameters • Introduce latent variables z so that – the joint distribution p(x,z) of observed and latent variables is more tractable (since we know forms of components) – Complicated distribution is formed from simpler components • The original distribution is obtained by marginalizing the joint distributionMachine Learning ! !! ! ! Srihari 6 Alternative View of EM • This view recognizes key role of latent variables • Observed data Latent Variables – where nth row represents xnT=[xn1 xn2 xnD] – with corresponding row znT=[zn1 zn2 znK] • Goal of EM algorithm is to find maximum likelihood solution for p(X) given some X!• When we do not have Z!€ X =x1x2xn" # $ $ $ $ % & ' ' ' ' € Z =z1z2zn" # $ $ $ $ % & ' ' ' 'Machine Learning ! !! ! ! Srihari 7 Likelihood Function involving Latent Variables • Joint likelihood function is where is the set of all model parameters – E.g., means, covariances, responsibilities • Marginal likelihood function of observed data – From sum rule • Log likelihood function is € p(X |θ) = p(X,Z |θ)Z∑ € ln p(X |θ) = ln p(X, Z |θ)Z∑$ % & ' ( ) p(X, Z | θ)θMachine Learning ! !! ! ! Srihari 8 Latent Variables in EM • Log likelihood function is • Key Observation: – Summation over latent variables appears inside logarithm • Even if joint distribution belongs to exponential family the marginal distribution does not – Taking log of Sum of Gaussians does not give simple quadratic • Results in complicated expressions for maximum likelihood solution, i.e., what value of q maximizes the likelihood Summation inside brackets due to marginalization Not due to log-likelihood € ln p(X |θ) = ln p(X, Z |θ)Z∑$ % & ' ( ) p(X, Z | θ)p(X | θ)Machine Learning ! !! ! ! Srihari Complete and Incomplete Data Sets Complete Data {X,Z} • For each observation in X we know corresponding value of latent variable Z • Log-likelihood has the form – maximization is straightforward Incomplete Data {X} • Actual data set • Log likelihood function is • Maximization is difficult – summations inside logarithm 9 ln p(X |θ) = ln p(X,Z|θ)Z∑⎧⎨⎩⎫⎬⎭p(X, Z | θ)Machine Learning ! !! ! ! Srihari 10 Expectation of log-likelihood • Since we don’t have the complete data set {X,Z} we evaluate the expected log-likelihood, i.e., • Since we are given X, our state of knowledge of Z is given only by the posterior distribution of the latent variables • Thus expected log-likelihood of complete data is E [ ln p(X , Z |θ)]= p(Z | X,θ)ln p(X, Z |θ)Z∑Summation is due to expectation not sum rule! We maximize this. Note that the logarithm acts on the joint-- which is tractable E[ln p(X,Z|θ)]p(Z | X, θ)Machine Learning ! !! ! ! Srihari E and M Steps • E Step: Use current parameter value to find the posterior distribution of the latent variables given by • Use this posterior to find expectation of complete-data log-likelihood for some general parameter value . This expectation is • M Step: Determine revised parameter estimate by maximizing 11 Q(θ,θold) = ln p(X, Z |θ)p(Z | X,θ)z∑ € θnew= argmaxθQ(θ,θold)Summation due to expectation θoldp(Z | X, θold)θnewθMachine Learning ! !! ! ! Srihari 12 General EM Algorithm • Given joint distribution over observed variables X and latent variables Z governed by parameters ! goal is to maximize likelihood function • Step 1: Choose an initial setting for the parameters • Step 2: E Step: Evaluate • Step 3: M Step: Evaluate given by where • Check for convergence – of either log-likelihood or parameter values • If not satisfied then let • Return to Step 2 Q(θ,θold) = p(Z | X, Z∑θold)ln p(X, Z |θ)p(X, Z | θ)θθoldθnewp(Z | X, θold)p(X | θ) € θnew= argmaxθQ(θ,θold)θold←θnewMachine Learning ! !! ! ! Srihari 13 Missing Variables • EM has been described for maximum likelihood function when there are discrete latent variables • It can also be applied when there are unobserved variables corresponding to missing values in data set – Take the joint distribution of all variables and then marginalize over missing ones – EM is then used to maximize corresponding
View Full Document