1Probabilistic Graphical Models10-708Markov Chain Monte Carlo and Markov Chain Monte Carlo and Belief Propagation Belief Propagation Eric Xing Eric Xing Lecture 17, Nov 9, 2005Reading: MJ-Chap. 21, KF-Chap. 9Markov chain Monte Carlo (MCMC)z Importance sampling does not scale well to high dimensions.z Rao-Blackwellisation not always possible.z MCMC is an alternative.z Construct a Markov chain whose stationary distribution is the target density = P(X|e).z Run for Tsamples (burn-in time) until the chain converges/mixes/reaches stationary distribution.z Then collect M(correlated) samples xm.z Key issues:z Designing proposals so that the chain mixes rapidly.z Diagnosing convergence.2Markov Chainsz Definition:z Given an n-dimensional state spacez Random vector X = (x1,…,xn)z x(t)= x at time-step tz x(t)transitions to x(t+1)with probP(x(t)| x(t-1),…,x(1)) = T(x(t)| x(t-1)) = T(x(t-1)Æ x(t)) z Homogenous: chain determined by state x(0), fixed transition kernel T (rows sum to 1)z Equilibrium:π(x) is a stationary (equilibrium) distribution if π(x') = Σxπ(x) T(xÆx'). i.e., is a left eigenvector of the transition matrix πTT= πTT.()()⎟⎟⎟⎠⎞⎜⎜⎜⎝⎛=05050307007500250305020305020............X1X2X30.25 0.70.50.50.75 0.3Markov Chainsz An MC is irreducible if transition graph connectedz An MC is aperiodic if it is not trapped in cyclesz An MC is ergodic (regular) if you can get from state x to x ' in a finite number of steps.z Detailed balance: prob(x(t)Æx(i-1)) = prob(x(t-1)Æx(t))summing over x(t-1)z Detailed bal Æ stationary dist exists)|()()|()()()()()()()( 111 −−−=ttttttTpTpxxxxxx∑−−−=)()|()()()()()()(111tttttTppxxxxx3Metropolis-Hastingsz Treat the target distribution as stationary distributionz Sample from an easier proposal distribution, followed by an acceptance testz This induces a transition matrix that satisfies detailed balancez MH proposes moves according to Q(x'|x) and accepts samples with probability A(x'|x).z The induced transition matrix isz Detailed balance meansz Hence the acceptance ratio is)|'()|'()'(xxAxxQxxT=→)'|()'|()'()|'()|'()(xxAxxQxxxAxxQxππ=⎟⎟⎠⎞⎜⎜⎝⎛=)|'()()'|()'(,min)|'(xxQxxxQxxxAππ1Metropolis-Hastings1. Initialize x(0)2. While not mixing // burn-inzx=x(t)zt += 1,z sample u ~ Unif(0,1)z sample x* ~ Q(x*|x)-ifzx(t)= x* // transition-elsezx(t)= x// stay in current state z Reset t=0, for t =1:Nz x(t+1)) Å Draw sample (x(t))⎟⎟⎠⎞⎜⎜⎝⎛=<)|*()(*)|(*)(,min)|*(xxQxxxQxxxAuππ1Function Draw sample (x(t))4Mixing timez The εmixing time Tεis the minimal number of steps (from any starting distribution) until Dvar(P(T), π) ≤ε, where Dvaris the variational distance between the two distance:z Chains with low bandwidth (conductance) regions of space take a long time to mix.z This arises for GMs with deterministic or highly skewed potentials.X1X2X3X4X5X7X6)()(sup),(defvarAASA2121µµµµ−=⊂DMCMC example q(x*|x) ~ N(x(i),100)p(x) ~ 0.3 exp(-0.2x2) + 0.7 exp(-0.2(x-10)2)5Summary of MHz Random walk through state spacez Can simulate multiple chains in parallelz Much hinges on proposal distribution Qz Want to visit state space where p(X) puts massz Want A(x*|x) high in modes of p(X) z Chain mixes wellz Convergence diagnosisz How can we tell when burn-in is over?z Run multiple chains from different starting conditions, wait until they start “behaving similarly”.z Various heuristics have been proposed.Gibbs samplingz Gibbs sampling is an MCMC algorithm that is especially appropriate for inference in graphical models.z The proceduez we have variable set X={x1, x2, x3,... xN} for a GMz at each step one of the variables Xiis selected (at random or according to some fixed sequences), denote the remaining variables as X-i , and its current value as x-i(t-1)z Using the "alarm network" as an example, say at time t we choose XE, and we denote the current value assignments of the remaining variables,X-E, obtained from previous samples, as z the conditonal distributionp(Xi| x-i(t-1)) is computedz a value xi(t)is sampled from this distributionz the sample xi(t)replaces the previous sampled value of Xiin X.z i.e., {})()()()()(,,,11111 −−−−−−=tMtJtAtBtExxxxx)()()(tEtEtxxx∪=−−16Markov Blanketz Markov Blanket in BNz A variable is independent from others, given its parents, children and children‘s parents (d-separation).z MB in MRFz A variable is independent all its non-neighbors, given all its direct neighbors.⇒ p(Xi| X-i)= p(Xi| MB(Xi))z Gibbs samplingz Every step, choose one variable and sample it by P(X|MB(X)) based on previous sample.Gibbs sampling of the alarm networkz To calculate P(J|B1,M1)z Choose (B1,E0,A1,M1,J1) as a startz Evidences are B1, M1, variables are A, E, J.z Choose next variable as Az Sample A by P(A|MB(A))=P(A|B1, E0, M1, J1) suppose to be false.z (B1, E0, A0, M1, J1)z Choose next random variable as E, sample E~P(E|B1,A0) z ...MB(A)={B, E, J, M}MB(E)={A, B}7z Gibbs sampling is a special case of MHz The transition matrix updates each node one at a time using the following proposal: z This is efficient since for two reasonsz It leads to samples that is always accepted Thus z It is efficient since only depends on the values in Xi’s Markov blanketGibbs sampling())|'(),'(),(iiiiiixpxx−−−=→ xxxQ()()()()1111,min)|'()|()|()|'(,min),'(),()|(),(),'()|'(,min),(),('=⎟⎟⎠⎞⎜⎜⎝⎛=⎟⎟⎠⎞⎜⎜⎝⎛→→=→−−−−−−−−−−−−iiiiiiiiiiiiiiiiiiiiiiiixpxpxpxpxxxpxxxpxxxxxxxxxxxxxxQQA())|'(),'(),(iiiiiixpxx−−−=→ xxxT)|('iixp−xz Scheduling and ordering: z Sequential sweeping: in each "epoch" t, touch every r.v. in some order and yield an new sample, , after every r.v. is resampledz Randomly pick an r.v. at each time step z Blocking:z Large state space: state vector X comprised of many components (high dimension)z Some components can be correlated and we can sample components (i.e., subsets of r.v.,) one at a timez Gibbs sampling can fail if there are deterministic constraintGibbs sampling)(txX YZZ is xor• Suppose we observe Z= 1. The posterior has 2 modes: P(X= 1,Y= 0|Z= 1)and P(X= 0,Y= 1|Z= 1). if we start in mode 1, P(X|Y= 0, Z= 1) leaves X= 1, so we can’t move to mode 2 (Reducible Markov chain).• If all states have non-zero probability, the MC is guaranteed to be regular.• Sampling blocks of variables at a time can help improve mixing.8ChainsChains9The of
View Full Document