Mean Field and Variational Methods First approximate inferenceApproximate inference overviewApproximating the posterior v. approximating the priorKL divergence: Distance between distributionsFind simple approximate distributionBack to graphical modelsD(p||q) for mean field – KL the right wayD(q||p) for mean field – KL the reverse directionWhat you need to know so farAnnouncementsReverse KL & The Partition Function Back to the general caseUnderstanding Reverse KL, Energy Function & The Partition FunctionStructured Variational Approximate InferenceOptimization for mean fieldUnderstanding fixed point equationSimplifying fixed point equationQi only needs to consider factors that intersect XiThere are many stationary points!Very simple approach for finding one stationary pointMore general structured approximationsComputing update rule for general caseStructured Variational update requires infereceWhat you need to know about variational methods1Mean Field and Variational MethodsFirst approximate inferenceGraphical Models – 10708Carlos GuestrinCarnegie Mellon UniversityNovember 1st, 2006Readings:K&F: 11.1, 11.510-708 – Carlos Guestrin 20062 Approximate inference overviewSo far: VE & junction treesexact inferenceexponential in tree-widthThere are many many many many approximate inference algorithms for PGMsWe will focus on three representative ones:samplingvariational inferenceloopy belief propagation and generalized belief propagationThere will be a special recitation by Pradeep Ravikumar on more advanced methods10-708 – Carlos Guestrin 20063 Approximating the posterior v. approximating the priorPrior model represents entire world world is complicatedthus prior model can be very complicatedPosterior: after making observationssometimes can become much more sure about the way things aresometimes can be approximated by a simple modelFirst approach to approximate inference: find simple model that is “close” to posteriorFundamental problems:what is close?posterior is intractable result of inference, how can we approximate what we don’t have?DifficultySATGradeHappyJobCoherenceLetterIntelligence10-708 – Carlos Guestrin 20064 KL divergence: Distance between distributionsGiven two distributions p and q KL divergence:D(p||q) = 0 iff p=qNot symmetric – p determines where difference is importantp(x)=0 and q(x)0p(x)0 and q(x)=010-708 – Carlos Guestrin 20065 Find simple approximate distributionSuppose p is intractable posteriorWant to find simple q that approximates p KL divergence not symmetricD(p||q)true distribution p defines support of diff. the “correct” directionwill be intractable to computeD(q||p)approximate distribution defines supporttends to give overconfident resultswill be tractable10-708 – Carlos Guestrin 20066 Back to graphical modelsInference in a graphical model:P(x) = want to compute P(Xi|e)our p:What is the simplest q?every variable is independent:mean field approximationcan compute any prob. very efficiently10-708 – Carlos Guestrin 20067 D(p||q) for mean field – KL the right wayp:q:D(p||q)=10-708 – Carlos Guestrin 20068 D(q||p) for mean field – KL the reverse directionp:q:D(p||q)=10-708 – Carlos Guestrin 20069 What you need to know so farGoal:Find an efficient distribution that is close to posteriorDistance:measure distance in terms of KL divergenceAsymmetry of KL:D(p||q) D(q||p)Computing right KL is intractable, so we use the reverse KL10-708 – Carlos Guestrin 200610 AnnouncementsTomorrow’s recitationKhalid on Variational MethodsMonday’s special recitationKhalid on Dirichlet ProcessesAn exciting way to deal with model selection using graphical models, e.g., selecting (or averaging over) number of clusters in unsupervised learningyou even get to see a BN with infinitely many variables10-708 – Carlos Guestrin 200611 Reverse KL & The Partition FunctionBack to the general caseConsider again the defn. of D(q||p):p is Markov net PFTheorem: where energy functional:DifficultySATGradeHappyJobCoherenceLetterIntelligence10-708 – Carlos Guestrin 200612 Understanding Reverse KL, Energy Function & The Partition FunctionMaximizing Energy Functional Minimizing Reverse KLTheorem: Energy Function is lower bound on partition functionMaximizing energy functional corresponds to search for tight lower bound on partition function10-708 – Carlos Guestrin 200613 Structured Variational Approximate InferencePick a family of distributions Q that allow for exact inferencee.g., fully factorized (mean field)Find Q2Q that maximizes For mean field10-708 – Carlos Guestrin 200614 Optimization for mean fieldConstrained optimization, solved via Lagrangian multiplier 9 , such that optimization equivalent to:Take derivative, set to zeroTheorem: Q is a stationary point of mean field approximation iff for each i:10-708 – Carlos Guestrin 200615 Understanding fixed point equationDifficultySATGradeHappyJobCoherenceLetterIntelligence10-708 – Carlos Guestrin 200616 Simplifying fixed point equationDifficultySATGradeHappyJobCoherenceLetterIntelligence10-708 – Carlos Guestrin 200617 Theorem: The fixed point:is equivalent to:where the Scope[j] = Uj [ {Xi}Qi only needs to consider factors that intersect XiDifficultySATGradeHappyJobCoherenceLetterIntelligence10-708 – Carlos Guestrin 200618 There are many stationary points!10-708 – Carlos Guestrin 200619 Initialize Q (e.g., randomly or smartly)Set all vars to unprocessedPick unprocessed var Xiupdate Qi:set var i as processedif Qi changedset neighbors of Xi to unprocessedGuaranteed to convergeVery simple approach for finding one stationary pointDifficultySATGradeHappyJobCoherenceLetterIntelligence10-708 – Carlos Guestrin 200620 More general structured approximations Mean field very naïve approximationConsider more general form for Qassumption: exact inference doable over QTheorem: stationary point of energy functional:DifficultySATGradeHappyJobCoherenceLetterIntelligence10-708 – Carlos Guestrin 200621 Computing update rule for general caseConsider one :DifficultySATGradeHappyJobCoherenceLetterIntelligence10-708 – Carlos Guestrin 200622 Structured
View Full Document