11Mean Field and VariationalMethodsLoopy Belief PropagationGraphical Models – 10708Carlos GuestrinCarnegie Mellon UniversityNovember 8th, 2006Readings:K&F: 11.3, 11.5Yedidia et al. paper from the class website10-708 –©Carlos Guestrin 20062Understanding Reverse KL, Energy Function & The Partition Function Maximizing Energy Functional ⇔ Minimizing Reverse KL Theorem: Energy Function is lower bound on partition function Maximizing energy functional corresponds to search for tight lower bound on partition function210-708 –©Carlos Guestrin 20063Structured Variational Approximate Inference Pick a family of distributions Q that allow for exact inference e.g., fully factorized (mean field) Find Q∈Q that maximizes For mean field10-708 –©Carlos Guestrin 20064Optimization for mean field Constrained optimization, solved via Lagrangian multiplier ∃ λ, such that optimization equivalent to: Take derivative, set to zero Theorem: Q is a stationary point of mean field approximation iff for each i:310-708 –©Carlos Guestrin 20065Understanding fixed point equationDifficultySATGradeHappyJobCoherenceLetterIntelligence10-708 –©Carlos Guestrin 20066Simplifying fixed point equationDifficultySATGradeHappyJobCoherenceLetterIntelligence410-708 –©Carlos Guestrin 20067 Theorem: The fixed point:is equivalent to: where the Scope[φj] = Uj∪ {Xi}Qionly needs to consider factors that intersect XiDifficultySATGradeHappyJobCoherenceLetterIntelligence10-708 –©Carlos Guestrin 20068There are many stationary points!510-708 –©Carlos Guestrin 20069 Initialize Q (e.g., randomly or smartly) Set all vars to unprocessed Pick unprocessed var Xi update Qi: set var i as processed if Qichanged set neighbors of Xito unprocessed Guaranteed to convergeVery simple approach for finding one stationary pointDifficultySATGradeHappyJobCoherenceLetterIntelligence10-708 –©Carlos Guestrin 200610More general structured approximations Mean field very naïve approximation Consider more general form for Q assumption: exact inference doable over Q Theorem: stationary point of energy functional:DifficultySATGradeHappyJobCoherenceLetterIntelligence610-708 –©Carlos Guestrin 200611Computing update rule for general case Consider one φ:DifficultySATGradeHappyJobCoherenceLetterIntelligence10-708 –©Carlos Guestrin 200612Structured Variational update requires inferece Compute marginals wrt Q of cliques in original graph and cliques in new graph, for all cliques What is a good way of computing all these marginals? Potential updates: sequential: compute marginals, update ψj, recompute marginals parallel: compute marginals, update all ψ’s, recompute marginals710-708 –©Carlos Guestrin 200613What you need to know about variational methods Structured Variational method: select a form for approximate distribution minimize reverse KL Equivalent to maximizing energy functional searching for a tight lower bound on the partition function Many possible models for Q: independent (mean field) structured as a Markov net cluster variational Several subtleties outlined in the book10-708 –©Carlos Guestrin 200614Announcements Tomorrow’s recitation Ajit on Loopy BP810-708 –©Carlos Guestrin 200615Recall message passing over junction trees Exact inference: generate a junction tree message passing over neighbors inference exponential in size of cliqueDifficultySATGradeHappyJobCoherenceLetterIntelligenceDIGGJSLHGJCDGSI10-708 –©Carlos Guestrin 200616Belief Propagation on Tree Pairwise Markov Nets Tree pairwise Markov net is a tree!!! ☺ no need to create a junction tree Message passing: More general equation: N(i) – neighbors of i in pairwise MN Theorem: Converges to true probabilities:DifficultySATGradeHappyJobCoherenceLetterIntelligence910-708 –©Carlos Guestrin 200617Loopy Belief Propagation on Pairwise Markov Nets What if we apply BP in a graph with loops? send messages between pairs of nodes in graph, and hope for the best What happens? evidence goes around the loops multiple times may not converge if it converges, usually overconfident about probability values But often gives you reasonable, or at least useful answers especially if you just care about the MPE rather than the actual probabilitiesDifficultySATGradeHappyJobCoherenceLetterIntelligence10-708 –©Carlos Guestrin 200618More details on Loopy BP Numerical problem: messages < 1 get multiplied togetheras we go around the loops numbers can go to zero normalize messages to one: Zi→jdoesn’t depend on Xj, so doesn’t change the answer Computing node “beliefs” (estimates of probs.): DifficultySATGradeHappyJobCoherenceLetterIntelligence1010-708 –©Carlos Guestrin 200619An example of running loopy BP10-708 –©Carlos Guestrin 200620Convergence If you tried to send all messages, and beliefs haven’t changed (by much) → convergedDifficultySATGradeHappyJobCoherenceLetterIntelligence1110-708 –©Carlos Guestrin 200621(Non-)Convergence of Loopy BP Loopy BP can oscillate!!! oscillations can small oscillations can be really bad! Typically, if factors are closer to uniform, loopy does well (converges) if factors are closer to deterministic, loopy doesn’t behave well One approach to help: damping messages new message is average of old message and new one: often better convergence but, when damping is required to get convergence, result often badgraphs from Murphy et al. ’9910-708 –©Carlos Guestrin 200622Loopy BP in Factor graphs What if we don’t have pairwiseMarkov nets?1. Transform to a pairwise MN2. Use Loopy BP on a factor graph Message example: from node to factor: from factor to node:A B C D EABC ABD BDE CDE1210-708 –©Carlos Guestrin 200623Loopy BP in Factor graphs From node i to factor j: F(i) factors whose scope includes Xi From factor j to node i: Scope[φj] = Y∪{Xi}A B C D EABC ABD BDE CDE10-708 –©Carlos Guestrin 200624What you need to know about loopy BP Application of belief propagation in loopy graphs Doesn’t always converge damping can help good message schedules can help (see book) If converges, often to incorrect, but useful results Generalizes from pairwise Markov networks by using factor
View Full Document