1Probabilistic Graphical Models10-708Learning Partially Observed Learning Partially Observed Graphical Models Graphical Models Eric Xing Eric Xing Lecture 13, Oct 26, 2005Reading: MJ-Chap. 5,10,11Partially observed GMsz Speech recognitionA AA AX2X3X1XTY2Y3Y1YT... ...2Partially observed GMz Biological EvolutionancestorACQhQmT years?AGAGACUnobserved Variablesz A variable can be unobserved (latent) because:z it is an imaginary quantity meant to provide some simplified andabstractive view of the data generation processz e.g., speech recognition models, mixture models …z it is a real-world object and/or phenomena, but difficult or impossible to measurez e.g., the temperature of a star, causes of a disease, evolutionary ancestors …z it is a real-world object and/or phenomena, but sometimes wasn’t measured, because of faulty sensors, etc.z Discrete latent variables can be used to partition/cluster data into sub-groups.z Continuous latent variables (factors) can be used for dimensionality reduction (factor analysis, etc).3Mixture modelsz A density model p(x) may be multi-modal.z We may be able to model it as a mixture of uni-modal distributions (e.g., Gaussians).z Each mode may correspond to a different sub-population (e.g., male and female).Gaussian Mixture Models (GMMs)z Consider a mixture of K Gaussian components:zZis a latent class indicator vector:zXis a conditional Gaussian variable with a class-specific mean/covariancez The likelihood of a sample:z This model can be used for unsupervised clustering.z This model (fit by AutoClass) has been used to discover new kinds of stars in astronomical data, etc.()∏):(multi)(kzknnknzzp ππ=={})-()-(-exp)(),,|(//knkTknkmknnxxzxp µµπµ121212211−ΣΣ=Σ=()()∑∑∏∑Σ=Σ=Σ===ΣkkkkzkzkknzkkkknxNxNzxpzpxpnknkn),|,(),:(),,|,()|(),(µπµπµπµ11mixture proportion mixture componentZX4Conditional mixture model: Mixture of expertsz We will model p(Y|X) using different experts, each responsible for different regions of the input space.z Latent variable Zchooses expert using softmax gating function: z Each expert can be a linear regression model:z The posterior expert responsibilities are()xxzPTkξSoftmax)( == 1()21kTkkxyzxyP σθ,;),(N==∑====jjjjjkkkkkxypxzpxypxzpyxzP),,()(),,()(),,(22111σθσθθHierarchical mixture of expertsz This is like a soft version of a depth-2 classification/regression tree.zP(Y|X,G1,G2) can be modeled as a GLIM, with parameters dependent on the values of G1and G2(which specify a "conditional path" to a given leaf in the tree).5Mixture of overlapping expertsz By removing the XÆZarc, we can make the partitions independent of the input, thus allowing overlap.z This is a mixture of linear regressors; each subpopulation has a different conditional mean.∑====jjjjjkkkkkxypzpxypzpyxzP),,()(),,()(),,(22111σθσθθWhy is Learning Harder?z In fully observed iid settings, the log likelihood decomposes into a sum of local terms (at least for directed models).z With latent variables, all the parameters become coupled together via marginalization),|(log)|(log)|,(log);(xzczxpzpzxpDθθθθ+==l∑∑==zxzzczxpzpzxpD),|()|(log)|,(log);(θθθθl6Gradient Learning for mixture modelsz We can learn mixture densities using gradient descent on the log likelihood. The gradients are quite interesting:z In other words, the gradient is the responsibility weighted sum of the individual log likelihood gradients.z Can pass this to a conjugate gradient routine.∑∑∑∑∑∂∂=∂∂=∂∂=∂∂=∂∂==kkkkkkkkkkkkkkkkkkkkkkkkkrppppppppppθθθθθπθθθθπθθπθθθπθθlll)x(log)|x()x()x(log)x()|x()x()|x()x(log)|x(log)(1Parameter Constraintsz Often we have constraints on the parameters, e.g. Σkπk= 1, Σ being symmetric positive definite (hence Σii> 0).z We can use constrained optimization, or we can reparameterize in terms of unconstrained values.z For normalized weights, use the softmax transform: z For covariance matrices, use the Cholesky decomposition:where A is upper diagonal with positive diagonal:the parameters γi, λi, ηij∈ R are unconstrained.z Use chain rule to compute )exp()exp(jjkkγγπΣ=AAT=Σ−1())( )( expijijijijijiii<=>=>= 00 AAAηλ. ,A∂∂∂∂llπ7Identifiabilityz A mixture model induces a multi-modal likelihood.z Hence gradient ascent can only find a local maximum.z Mixture models are unidentifiable, since we can always switch the hidden labels without affecting the likelihood.z Hence we should be careful in trying to interpret the “meaning” of latent variables.Expectation-Maximization (EM) Algorithmz EM is an optimization strategy for objective functions that can be interpreted as likelihoods in the presence of missing data.z It is much simpler than gradient methods:z No need to choose step size.z Enforces constraints automatically.z Calls inference and fully observed learning as subroutines.z EM is an Iterative algorithm with two linked steps:z E-step: fill-in hidden values using inference, p(z|x, θt).z M-step: update parameters t+1 using standard MLE/MAP method applied to completed dataz We will prove that this procedure monotonically improves (or leaves it unchanged). Thus it always converges to a local optimum of the likelihood.8Complete & Incomplete Log Likelihoodsz Complete log likelihoodLet Xdenote the observable variable(s), and Zdenote the latent variable(s). If Zcould be observed, thenz Usually, optimizing lc() given both zand xis straightforward (c.f. MLE for fully observed models).z Recalled that in this case the objective for, e.g., MLE, decomposes into a sum of factors, the parameter for each factor can be estimated separately.z But given that Zis not observed, lc() is a random quantity, cannot be maximized directly.z Incomplete log likelihoodWith zunobserved, our objective becomes the log of a marginal probability:z This objective won't decouple )|,(log),;(defθθ zxpzxc=l∑==zczxpxpx)|,(log)|(log);(θθθlExpected Complete Log Likelihoodz For any distribution q(z), define expected complete log likelihood:z A deterministic function of θz Linear in lc() --- inherit its factorizabiilityz Does maximizing this surrogate yield a maximizer of the likelihood?zJensen’s inequality∑=zqczxpxzqzx)|,(log),|(),;(defθθθl∑∑∑====zzzxzqzxpxzqxzqzxpxzqzxpxpx)|()|,(log)|()|()|,()|(log)|,(log)|(log);(θθθθθlqqcHzxx+≥⇒ ),;();( θθ ll9Lower Bounds and Free Energyz For fixed data x, define a functional called the free energy:z The EM algorithm is
View Full Document