1 1 Undirected Graphical Models (finishing off) Graphical Models – 10708 Carlos Guestrin Carnegie Mellon University November 3rd, 2008 Readings: K&F: 4.1, 4.2, 4.3, 4.4, 4.5 10-708 – ©Carlos Guestrin 2006-2008 10-708 – ©Carlos Guestrin 2006-2008 2 What you learned about so far Bayes nets Junction trees (General) Markov networks Pairwise Markov networks Factor graphs How do we transform between them? More formally: I give you an graph in one representation, find an I-map in the other2 10-708 – ©Carlos Guestrin 2006-2008 3 BNs ! MNs: Moralization Theorem: Given a BN G the Markov net H formed by moralizing G is the minimal I-map for I(G) Intuition: in a Markov net, each factor must correspond to a subset of a clique the factors in BNs are the CPTs CPTs are factors over a node and its parents thus node and its parents must form a clique Effect: some independencies that could be read from the BN graph become hidden Difficulty SAT Grade Happy Job Coherence Letter Intelligence Difficulty SAT Grade Happy Job Coherence Letter Intelligence 10-708 – ©Carlos Guestrin 2006-2008 4 From Markov nets to Bayes nets Exam Grade Job Letter Intelligence SAT3 10-708 – ©Carlos Guestrin 2006-2008 5 MNs ! BNs: Triangulation Theorem: Given a MN H, let G be the Bayes net that is a minimal I-map for I(H) then G must be chordal Intuition: v-structures in BN introduce immoralities these immoralities were not present in a Markov net the triangulation eliminates immoralities Effect: many independencies that could be read from the MN graph become hidden Exam Grade Job Letter Intelligence SAT Exam Grade Job Letter Intelligence SAT 10-708 – ©Carlos Guestrin 2006-2008 6 Markov nets v. Pairwise MNs Every Markov network can be transformed into a Pairwise Markov net introduce extra “variable” for each factor over three or more variables domain size of extra variable is exponential in number of vars in factor Effect: any local structure in factor is lost a chordal MN doesn’t look chordal anymore A B C4 10-708 – ©Carlos Guestrin 2006-2008 7 Overview of types of graphical models and transformations between them 8 Mean Field and Variational Methods First approximate inference Graphical Models – 10708 Carlos Guestrin Carnegie Mellon University November 3rd, 2008 Readings: K&F: 10.1, 10.5 10-708 – ©Carlos Guestrin 2006-20085 10-708 – ©Carlos Guestrin 2006-2008 9 Approximate inference overview So far: VE & junction trees exact inference exponential in tree-width There are many many many many approximate inference algorithms for PGMs We will focus on three representative ones: sampling variational inference loopy belief propagation and generalized belief propagation 10-708 – ©Carlos Guestrin 2006-2008 10 Approximating the posterior v. approximating the prior Prior model represents entire world world is complicated thus prior model can be very complicated Posterior: after making observations sometimes can become much more sure about the way things are sometimes can be approximated by a simple model First approach to approximate inference: find simple model that is “close” to posterior Fundamental problems: what is close? posterior is intractable result of inference, how can we approximate what we don’t have? Difficulty SAT Grade Happy Job Coherence Letter Intelligence6 10-708 – ©Carlos Guestrin 2006-2008 11 KL divergence: Distance between distributions Given two distributions p and q KL divergence: D(p||q) = 0 iff p=q Not symmetric – p determines where difference is important p(x)=0 and q(x)≠0 p(x)≠0 and q(x)=0 10-708 – ©Carlos Guestrin 2006-2008 12 Find simple approximate distribution Suppose p is intractable posterior Want to find simple q that approximates p KL divergence not symmetric D(p||q) true distribution p defines support of diff. the “correct” direction will be intractable to compute D(q||p) approximate distribution defines support tends to give overconfident results will be tractable7 10-708 – ©Carlos Guestrin 2006-2008 13 Back to graphical models Inference in a graphical model: P(x) = want to compute P(Xi|e) our p: What is the simplest q? every variable is independent: mean field approximation can compute any prob. very efficiently 10-708 – ©Carlos Guestrin 2006-2008 14 D(p||q) for mean field – KL the right way p: q: D(p||q)=8 10-708 – ©Carlos Guestrin 2006-2008 15 D(q||p) for mean field – KL the reverse direction p: q: D(q||p)= 10-708 – ©Carlos Guestrin 2006-2008 16 D(q||p) for mean field – KL the reverse direction: Entropy term p: q:9 10-708 – ©Carlos Guestrin 2006-2008 17 D(q||p) for mean field – KL the reverse direction: cross-entropy term p: q: 10-708 – ©Carlos Guestrin 2006-2008 18 What you need to know so far Goal: Find an efficient distribution that is close to posterior Distance: measure distance in terms of KL divergence Asymmetry of KL: D(p||q) ≠ D(q||p) Computing right KL is intractable, so we use the reverse KL10 10-708 – ©Carlos Guestrin 2006-2008 19 Reverse KL & The Partition Function Back to the general case Consider again the defn. of D(q||p): p is Markov net PF Theorem: where energy functional: Difficulty SAT Grade Happy Job Coherence Letter Intelligence 10-708 – ©Carlos Guestrin 2006-2008 20 Understanding Reverse KL, Energy Function & The Partition Function Maximizing Energy Functional ⇔ Minimizing Reverse KL Theorem: Energy Function is lower bound on partition function Maximizing energy functional corresponds to search for tight lower bound on partition function11 10-708 – ©Carlos Guestrin 2006-2008 21 Structured Variational Approximate Inference Pick a family of distributions Q that allow for exact inference e.g., fully factorized (mean field) Find Q2Q that maximizes For mean field 10-708 –
View Full Document