DOC PREVIEW
CMU CS 10708 - Learning Partially Observed Graphical Models

This preview shows page 1-2-3-4-5 out of 16 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 16 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 16 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 16 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 16 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 16 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 16 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

1Probabilistic Graphical Models10-708Learning Partially Observed Learning Partially Observed Graphical Models Graphical Models Eric Xing Eric Xing Lecture 13, Oct 26, 2005Reading: MJ-Chap. 5,10,11Partially observed GMsz Speech recognitionA AA AX2X3X1XTY2Y3Y1YT... ...2Partially observed GMz Biological EvolutionancestorACQhQmT years?AGAGACUnobserved Variablesz A variable can be unobserved (latent) because:z it is an imaginary quantity meant to provide some simplified andabstractive view of the data generation processz e.g., speech recognition models, mixture models …z it is a real-world object and/or phenomena, but difficult or impossible to measurez e.g., the temperature of a star, causes of a disease, evolutionary ancestors …z it is a real-world object and/or phenomena, but sometimes wasn’t measured, because of faulty sensors, etc.z Discrete latent variables can be used to partition/cluster data into sub-groups.z Continuous latent variables (factors) can be used for dimensionality reduction (factor analysis, etc).3Mixture modelsz A density model p(x) may be multi-modal.z We may be able to model it as a mixture of uni-modal distributions (e.g., Gaussians).z Each mode may correspond to a different sub-population (e.g., male and female).Gaussian Mixture Models (GMMs)z Consider a mixture of K Gaussian components:zZis a latent class indicator vector:zXis a conditional Gaussian variable with a class-specific mean/covariancez The likelihood of a sample:z This model can be used for unsupervised clustering.z This model (fit by AutoClass) has been used to discover new kinds of stars in astronomical data, etc.()∏):(multi)(kzknnknzzp ππ=={})-()-(-exp)(),,|(//knkTknkmknnxxzxp µµπµ121212211−ΣΣ=Σ=()()∑∑∏∑Σ=Σ=Σ===ΣkkkkzkzkknzkkkknxNxNzxpzpxpnknkn),|,(),:(),,|,()|(),(µπµπµπµ11mixture proportion mixture componentZX4Conditional mixture model: Mixture of expertsz We will model p(Y|X) using different experts, each responsible for different regions of the input space.z Latent variable Zchooses expert using softmax gating function: z Each expert can be a linear regression model:z The posterior expert responsibilities are()xxzPTkξSoftmax)( == 1()21kTkkxyzxyP σθ,;),(N==∑====jjjjjkkkkkxypxzpxypxzpyxzP),,()(),,()(),,(22111σθσθθHierarchical mixture of expertsz This is like a soft version of a depth-2 classification/regression tree.zP(Y|X,G1,G2) can be modeled as a GLIM, with parameters dependent on the values of G1and G2(which specify a "conditional path" to a given leaf in the tree).5Mixture of overlapping expertsz By removing the XÆZarc, we can make the partitions independent of the input, thus allowing overlap.z This is a mixture of linear regressors; each subpopulation has a different conditional mean.∑====jjjjjkkkkkxypzpxypzpyxzP),,()(),,()(),,(22111σθσθθWhy is Learning Harder?z In fully observed iid settings, the log likelihood decomposes into a sum of local terms (at least for directed models).z With latent variables, all the parameters become coupled together via marginalization),|(log)|(log)|,(log);(xzczxpzpzxpDθθθθ+==l∑∑==zxzzczxpzpzxpD),|()|(log)|,(log);(θθθθl6Gradient Learning for mixture modelsz We can learn mixture densities using gradient descent on the log likelihood. The gradients are quite interesting:z In other words, the gradient is the responsibility weighted sum of the individual log likelihood gradients.z Can pass this to a conjugate gradient routine.∑∑∑∑∑∂∂=∂∂=∂∂=∂∂=∂∂==kkkkkkkkkkkkkkkkkkkkkkkkkrppppppppppθθθθθπθθθθπθθπθθθπθθlll)x(log)|x()x()x(log)x()|x()x()|x()x(log)|x(log)(1Parameter Constraintsz Often we have constraints on the parameters, e.g. Σkπk= 1, Σ being symmetric positive definite (hence Σii> 0).z We can use constrained optimization, or we can reparameterize in terms of unconstrained values.z For normalized weights, use the softmax transform: z For covariance matrices, use the Cholesky decomposition:where A is upper diagonal with positive diagonal:the parameters γi, λi, ηij∈ R are unconstrained.z Use chain rule to compute )exp()exp(jjkkγγπΣ=AAT=Σ−1())( )( expijijijijijiii<=>=>= 00 AAAηλ. ,A∂∂∂∂llπ7Identifiabilityz A mixture model induces a multi-modal likelihood.z Hence gradient ascent can only find a local maximum.z Mixture models are unidentifiable, since we can always switch the hidden labels without affecting the likelihood.z Hence we should be careful in trying to interpret the “meaning” of latent variables.Expectation-Maximization (EM) Algorithmz EM is an optimization strategy for objective functions that can be interpreted as likelihoods in the presence of missing data.z It is much simpler than gradient methods:z No need to choose step size.z Enforces constraints automatically.z Calls inference and fully observed learning as subroutines.z EM is an Iterative algorithm with two linked steps:z E-step: fill-in hidden values using inference, p(z|x, θt).z M-step: update parameters t+1 using standard MLE/MAP method applied to completed dataz We will prove that this procedure monotonically improves (or leaves it unchanged). Thus it always converges to a local optimum of the likelihood.8Complete & Incomplete Log Likelihoodsz Complete log likelihoodLet Xdenote the observable variable(s), and Zdenote the latent variable(s). If Zcould be observed, thenz Usually, optimizing lc() given both zand xis straightforward (c.f. MLE for fully observed models).z Recalled that in this case the objective for, e.g., MLE, decomposes into a sum of factors, the parameter for each factor can be estimated separately.z But given that Zis not observed, lc() is a random quantity, cannot be maximized directly.z Incomplete log likelihoodWith zunobserved, our objective becomes the log of a marginal probability:z This objective won't decouple )|,(log),;(defθθ zxpzxc=l∑==zczxpxpx)|,(log)|(log);(θθθlExpected Complete Log Likelihoodz For any distribution q(z), define expected complete log likelihood:z A deterministic function of θz Linear in lc() --- inherit its factorizabiilityz Does maximizing this surrogate yield a maximizer of the likelihood?zJensen’s inequality∑=zqczxpxzqzx)|,(log),|(),;(defθθθl∑∑∑====zzzxzqzxpxzqxzqzxpxzqzxpxpx)|()|,(log)|()|()|,()|(log)|,(log)|(log);(θθθθθlqqcHzxx+≥⇒ ),;();( θθ ll9Lower Bounds and Free Energyz For fixed data x, define a functional called the free energy:z The EM algorithm is


View Full Document

CMU CS 10708 - Learning Partially Observed Graphical Models

Documents in this Course
Lecture

Lecture

15 pages

Lecture

Lecture

25 pages

Lecture

Lecture

24 pages

causality

causality

53 pages

lecture11

lecture11

16 pages

Exam

Exam

15 pages

Notes

Notes

12 pages

lecture

lecture

18 pages

lecture

lecture

16 pages

Lecture

Lecture

17 pages

Lecture

Lecture

15 pages

Lecture

Lecture

17 pages

Lecture

Lecture

19 pages

Lecture

Lecture

42 pages

Lecture

Lecture

16 pages

r6

r6

22 pages

lecture

lecture

20 pages

lecture

lecture

35 pages

Lecture

Lecture

19 pages

Lecture

Lecture

21 pages

lecture

lecture

21 pages

lecture

lecture

13 pages

review

review

50 pages

Semantics

Semantics

30 pages

lecture21

lecture21

26 pages

MN-crf

MN-crf

20 pages

hw4

hw4

5 pages

lecture

lecture

12 pages

Lecture

Lecture

25 pages

Lecture

Lecture

25 pages

Lecture

Lecture

14 pages

Lecture

Lecture

15 pages

Load more
Download Learning Partially Observed Graphical Models
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Learning Partially Observed Graphical Models and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Learning Partially Observed Graphical Models 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?