DOC PREVIEW
MIT 6 454 - Study Guide

This preview shows page 1-2-3 out of 9 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 9 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 9 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 9 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 9 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

6.975 Week 11 Summary:Expectation PropagationErik SudderthNovember 20, 20021 IntroductionExpectation Propagation (EP) [2, 3] addresses the problem of constructing tractable approx-imations to complex probability distributions. Let x be a set of random variables of interest,and p(x) be a distribution formed from the product of several “compatibility functions”:p(x) ∝Yiψi(x) (1)EP is an iterative algorithm which attempts to choose the best approximation to p(x) fromwithin some tractable class of distributions. To ensure that each iteration of EP is computa-tionally feasible, we choose the approximating class to correspond to an exponential familyq(x; θ).Distributions p(x) of the form shown in equation (1) arise in a huge range of applications.For example, suppose that we have n independent observations yiof an unobserved randomvariable x with prior distribution p0(x). By Bayes’ rule, the posterior distribution over x isthen given byp (x | y1, . . . , yn) = p0(x)nYi=1pi(yi|x) (2)Similarly, the prior or posterior distribution corresponding to any graphical model may be1written as in equation (1). For example, a pairwise Markov random field with nodes V, edgesE, hidden variables xs, and local observations yshas posterior distributionp (x | y1, . . . , yn) =Ys∈Vψs(xs, ys)Y(s,t)∈Eψs,t(xs, xt) (3)In the remainder of this summary, we focus on inference problems defined by some fixed setof observations y. Thus, rather than explicitly specifying the observations, we assume thatthe target distribution is written as in equation (1).2 Exponential FamiliesAn exponential family of distributions q(x; θ) can be written asq(x; θ) = exp(Xαθαφα(x) − Φ(θ))(4)The parameter vector θ indexes the distributions in the family, each of which correspondsto a different weighting of the potential functions φα(x). The log partition function Φ(θ)ensures that q(x; θ) is properly normalized for any choice of parameters θ. A wide range ofclassic distributions, including Gaussian, Poisson, and discrete multinomial, may be writtenin exponential form. Note that when x is a continuous variable, normalization may only bepossible for certain choices of θ (e.g. Gaussian distributions must have nonnegative variance).Exponential families have a number of properties which simplify standard computations.Expectation Propagation makes extensive use of two of these features. First, if we multiplyor divide two exponential distributions, we produce a new distribution which is a member ofthe same exponential family (although normalizability may be lost). The coefficients of theproduct or quotient distribution are equal to the sum or difference of the input coefficients.Second, consider the problem of approximating some arbitrary distribution p(x) with amember of an exponential family. We choose the best approximation by minimizing the2following Kullback–Leibler divergence:θ∗= arg minθD (p(x) || q(x; θ)) (5)The optimal solution to this problem is given by moment matching. In particular, θ∗shouldbe chosen so thatZq(x; θ∗)φα(x) dx =Zp(x)φα(x) dx (6)for all potential functions φα(x) in the exponential family. For many commonly used expo-nential families, the mapping between moment and exponential parameterizations is easilycomputed. Thus, in cases where it is tractable to compute moments of p(x), the minimizationof equation (5) has a closed form solution.3 Assumed Density FilteringExpectation Propagation can be seen as a method for iteratively refining the solution pro-duced by classic Assumed Density Filtering (ADF) methods, such as the extended Kalmanfilter. Consider the factorized density p(x) of equation (1). ADF begins by choosing q(x; θ1)to best approximate the first compatibility function ψ1(x) according to equation (5). We thenproceed through the remaining compatibility functions in order, updating the approximationto p(x) asθi= arg minθDψi(x)q(x; θi−1) || q(x; θ)(7)At each ADF iteration, the current best estimate q(x; θi−1) of the product distribution isused to guide the incorporation of the next compatibility function ψi(x). While this ispreferable to constructing independent approximations to each term, it has the undesirableproperty that the ADF estimate is sensitive to the order in which the compatibility terms areprocessed. In particular, if the first few terms are “misleading”, so that their product is verydifferent from the true product p(x), ADF may produce extremely poor approximations.3In the previous paragraph, we described ADF as iteratively refining the approximateposterior distribution q(x; θ). Alternatively, however, each ADF iteration can be seen asfirst approximating the compatibility function ψi(x) by a member of the exponential familymi(x), and then exactly updating the posterior distribution:q(x; θi) = mi(x)q(x; θi−1) mi(x) ∝q(x; θi)q(x; θi−1)(8)Note that each mi(x) is in the same exponential family as q(x; θ). In the following section,we show how this alternate interpretation naturally leads to the EP algorithm.4 Expectation PropagationConsider the ADF algorithm of the previous section. The best approximation mi(x) to aparticular compatibility function ψi(x) would be produced by directly minimizingD p(x) || mi(x)Yj6=iψj(x)!(9)However, because direct computations with p(x) are intractable, ADF must neglect mostof the product terms in computing its approximations to the first processed compatibilityfunctions. In standard ADF, these initial approximations are never revisited, and thereforeerrors may significantly bias the final approximation.The EP algorithm exploits the interpretation of ADF as approximating compatibilityfunctions (as in equation (8)) to revisit each term approximation multiple times. At lateriterations, EP uses its current best estimates of all but one compatibility to improve the ex-ponential approximation to the remaining term. One hopes that by iterating this procedure,EP will converge to a fixed point which approximates p(x) better than the results of anyparticular ADF ordering.We initialize the EP algorithm by setting the compatibility approximations to some4default values, typically mi(x) = 1. The posterior approximation to p(x) is initialized asq(x; θ) ∝Qimi(x). Each iteration of EP then proceeds as follows:1. Choose some mi(x) to refine.2. Remove the effects of mi(x) from the current posterior estimate q(x; θ) by dividing andnormalizing:q(x; θ\i) ∝q(x; θ)mi(x)(10)3. Update the


View Full Document

MIT 6 454 - Study Guide

Documents in this Course
Load more
Download Study Guide
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Study Guide and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Study Guide 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?