UW-Madison CS 731 - Exponential Families and Graphical Models - D2777609

Home> Schools> University of Wisconsin, Madison> (CS) > CS 731> Exponential Families and Graphical Models

DOC PREVIEW

UW-Madison CS 731 - Exponential Families and Graphical Models

School name University of Wisconsin, Madison

Course Cs 731- Advanced Artificial Intelligence (

Pages 4

This preview shows page 1 out of 4 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 4 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 4 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

The Maximum Entropy PrincipleExponential FamiliesMean Parametrization and Marginal PolytopesThe Log Partition Function AConjugate DualityCS731 Spring 2011 Advanced Artificial IntelligenceExponential Families and Graphical ModelsLecturer: Xiaojin Zhu [email protected] The Maximum Entropy PrincipleConsider the maximum entropy problem: Given iid items x1, . . . , xn∼ p∗, letˆµj≡1nnXi=1φj(xi), (1)where φj: X 7→ R is some function for j = 1 . . . d. The value ˆµjis the empirical expectation of φj.Can we recover p∗from ˆµ1. . . ˆµd? Consider density p absolutely continuous w.r.t. a base measure ν,which usually is the counting measure for probability mass function p, or the Lebesgue measure on R forcontinuous p. We say p is consistent with the data, ifEp[φj(x)] ≡ZXφj(x)p(x)ν(dx) = ˆµjfor j ∈ {1 . . . d}. (2)In general (when d is small), there will be many distributions that are consistent with the data. Themaximum entropy principle is to pick the distributionˆp = argmaxpZX−p(x) log p(x)ν(dx) (3)s.t. Ep[φj(x)] = ˆµjfor all j = 1 . . . d. (4)On one hand, there is no guarantee that ˆp = p∗. On the other hand, ˆp has a very interesting form:ˆp(x) ∝ expdXj=1θjφj(x)(5)with parameters θj∈ R.2 Exponential FamiliesThe solution above is in the so called exponential families. Formally, let φ = (φ1, . . . , φd)>be d sufficientstatistics, where φi: X 7→ R. Note X here in general is a high dimensional object itself, correspondingto all the nodes in a Graphical model. Let θ = (θ1, . . . , θd)>be an associated canonical parameters. Theexponential family is a family of probability densities:pθ(x) = expθ>φ(x) − A(θ)(6)F The key is the linear interaction (inner product) between parameters θ and sufficient statistics φ.A is the log partition functionA(θ) = logZexpθ>φ(x)ν(d x). (7)F A = log Z.1Exponential Families and Graphical Models 2DefineΩ = {θ ∈ Rd| A(θ) < ∞} (8)i.e., those parameters for which the density is normalizable. A regular exponential family is where Ω is anopen set. A minimal exponential family is where the φ’s are linearly independent, namely there does notexist a nonzero α ∈ Rdsuch that α>φ(x) = constant for all x. If an exponential family is not minimal, it iscalled overcomplete. Both minimal and overcomplete representations are useful.Example 1 (Bernoulli) Let p(x) = βx(1 − β)1−xfor x ∈ {0, 1} and β ∈ (0, 1). Although it does not looklike an exponential family, one can equivalently express the density asp(x) = exp (x log β + (1 − x) log(1 − β)) (9)= exp (xθ − log(1 + exp(θ))) , (10)where θ = logβ1−β. Note (9) is in exponential family form with φ1(x) = x, φ2(x) = 1 − x, θ1= log β, θ2=log(1 − β), and A(θ) = 0. Further note that α1= α2= 1 makes α>φ(x) = 1 for all x, thus (9) isan overcomplete representation. In contrast, (10) is a minimal exponential family with φ(x) = x, θ =logβ1−β, A(θ) = log(1 + exp(θ)).Many standard distributions (e.g., Gaussian, exponential, Poisson, Beta) are in the exp onential family,see standard textbooks for details. Not all familiar distributions are in the exponential family. For example,the Laplace distribution cannot be written as an exponential family (thanks Vincent Tan for supplying aproof).Example 2 (Ising Model) Let G = (V, E) be an undirected graph. Each node s ∈ V is associated with abinary random variable xs∈ {0, 1}. The Ising model is defined to bepθ(x) = expXs∈Vθsxs+X(s,t)∈Eθstxsxt− A(θ). (11)The sufficient statistics are d = |V |+|E| dimensional: φ(x) = (. . . xs. . . xst. . .)>. This is a regular (Ω = Rd),minimal exponential family.Example 3 (Potts Model) Similar to Ising model but generalizing xs∈ {0, . . . , r − 1}. Let indicatorfunctions fsj(x) = 1 if xs= j and 0 otherwise, and fstjk(x) = 1 if xs= j ∧ xt= k, and 0 otherwise. ThePotts model is defined to bepθ(x) = expXsjθsjfsj(x) +Xstjkθstjkfstjk(x) − A(θ). (12)Now d = r|V | + r2|E|. This is regular but overcomplete, becausePr−1j=0θsj(x) = 1 for any s ∈ V and all x.The Potts model is a special case where the parameters are tied: θstkk= α, and θstjk= β for j 6= k.Gaussian MRFs, Gaussian mixture models, and Latent Dirichlet Allocation can all be expressed inexponential family form.3 Mean Parametrization and Marginal PolytopesLet p be any density (not necessarily in exponential family). Given sufficient statistics φ, the mean parametersµ = (µ1, . . . , µd)>is defined asµi= Ep[φi(x)] =Zφi(x)p(x)ν(dx). (13)Consider the set of mean parameters realized by any distribution (not necessarily exponential family):M = {µ ∈ Rd| ∃p s.t. Ep[φ(x)] = µ}. (14)Exponential Families and Graphical Models 3Example 4 (The First Two Moments) Let φ1(x) = x, φ2(x) = x2. For any p (NB not necessarilyGaussian), the mean parameters µ = (µ1, µ2) = (E(x), E(x2))>. Since V(x) = E(x2) − E2(x) = µ2− µ21≥ 0for any p, we see that M is not R2but rather the subset defined by µ1∈ R, µ2≥ µ21.F It is helpful to think about p = unif orm(−a, a).M is a convex subset of Rd, since if µ(1), µ(2)∈ M there must exist p(1), p(2), and the convex combinationsof p(1), p(2)realizes the convex combinations of µ(1), µ(2).The marginal polytope is defined for any graphical model with multinomial random variables xs∈{0, 1, . . . , r − 1} (this can be generalizes so different xshave different rs). Consider the indicator func-tions defined in Example 3, which we call the standard overcomplete representation. The mean parametershas an intuitive interpretation:µsj= Ep[fsj(x)] = p(xs= j) (node marginals) (15)µstjk= Ep[fstjk(x)] = p(xs= j, xt= k) (edge marginals) (16)Furthermore, becauseM = {µ ∈ Rd| µ =Xxφ(x)p(x) for some p} (17)it is easy to see that M = conv{φ(x), ∀x}, i.e., the convex hull of point mass distributions that put mass ona single x. This convex hall is c alled the marginal polytope.Example 5 (The Marginal Polytope of a Tiny Ising Model) Consider a tiny Ising model with twonodes x1, x2∈ {0, 1} and an edge between them. The Ising model minimal representation is φ(x1, x2) =(x1, x2, x1x2)>. Note that there are only 4 different x = (x1, x2). Therefore, the marginal polytope is definedby the convex hull ofM = conv{(0, 0, 0), (0, 1, 0), (1, 0, 0), (1, 1, 1)}, (18)which is a polytope inside the unit cube. The three coordinates can be interpreted as µ1≡ Ep[x1=

View Full Document