11School of Computer ScienceApproximate Inference:Markov Chain Monte CarloProbabilistic Graphical Models (10Probabilistic Graphical Models (10--708)708)Lecture 19, Nov 26, 2007Eric XingEric XingReceptor AKinase CTF FGene GGene HKinaseEKinase DReceptor BX1X2X3X4X5X6X7X8Receptor AKinase CTF FGene GGene HKinaseEKinase DReceptor BX1X2X3X4X5X6X7X8X1X2X3X4X5X6X7X8Reading: J-Chap. 1, KF-Chap. 11Eric Xing 2Summary: Monte Carlo Methodsz Direct Sampling z Very difficult to populate a high-dimensional state space z Rejection Samplingz Create samples like direct sampling, only count samples which isconsistent with given evidences.z Likelihood weighting, ...z Sample variables and calculate evidence weight. Only create the samples which support the evidences.z Markov chain Monte Carlo (MCMC)z Metropolis-Hastingz Gibbs2Eric Xing 3Rao-Blackwellised samplingz Sampling in high dimensional spaces causes high variance in the estimate.z RB idea: sample some variables Xp, and conditional on that, compute expected value of rest Xdanalytically:z This has lower variance, because of the identity:z Hence , so is a lower variance estimator.[][][])|(~ ,),(),()|(),(),|()|(),()|,()(),|(),|()|(expxXxfEdxXxfEexpdxdxxxfexxpexpdxdxxxfexxpXfEpmpmdmpexXpMxpdpexXppxpxddppdpdpdpdpeXpmpdppdpd∑∫∫∫∫==⎟⎟⎠⎞⎜⎜⎝⎛==1[][][][][]pdppdpdpXXXEXXXEXX |),(var|),(var),(varτττ+=[][][]),(var|),(vardppdpXXXXXEττ≤[]pdpdpXXXfEXX |),(),( =τEric Xing 4Markov chain Monte Carlo (MCMC)z Importance sampling does not scale well to high dimensions.z Rao-Blackwellisation not always possible.z MCMC is an alternative.z Construct a Markov chain whose stationary distribution is the target density = P(X|e).z Run for Tsamples (burn-in time) until the chain converges/mixes/reaches stationary distribution.z Then collect M(correlated) samples xm.z Key issues:z Designing proposals so that the chain mixes rapidly.z Diagnosing convergence.3Eric Xing 5Markov Chainsz Definition:z Given an n-dimensional state spacez Random vector X = (x1,…,xn)z x(t)= x at time-step tz x(t)transitions to x(t+1)with probP(x(t+1)| x(t),…,x(1)) = T(x(t+1)| x(t)) = T(x(t)Æ x(t+1)) z Homogenous: chain determined by state x(0), fixed transition kernel T (rows sum to 1)z Equilibrium:π(x) is a stationary (equilibrium) distribution if π(x') = Σxπ(x) T(xÆx'). i.e., is a left eigenvector of the transition matrix πTT= πTT.()()⎟⎟⎟⎠⎞⎜⎜⎜⎝⎛=05050307007500250305020305020............X1X2X30.25 0.70.50.50.75 0.3Eric Xing 6Markov Chainsz An MC is irreducible if transition graph connectedz An MC is aperiodic if it is not trapped in cyclesz An MC is ergodic (regular) if you can get from state x to x ' in a finite number of steps.z Detailed balance: prob(x(t)Æx(i-1)) = prob(x(t-1)Æx(t))summing over x(t-1)z Detailed bal Æ stationary dist exists)|()()|()()()()()()()( 111 −−−=ttttttTpTpxxxxxx∑−−−=)()|()()()()()()(111tttttTppxxxxx4Eric Xing 7Metropolis-Hastingsz Treat the target distribution as stationary distributionz Sample from an easier proposal distribution, followed by an acceptance testz This induces a transition matrix that satisfies detailed balancez MH proposes moves according to Q(x'|x) and accepts samples with probability A(x'|x).z The induced transition matrix isz Detailed balance meansz Hence the acceptance ratio is)|'()|'()'(xxAxxQxxT=→)'|()'|()'()|'()|'()(xxAxxQxxxAxxQxππ=⎟⎟⎠⎞⎜⎜⎝⎛=)|'()()'|()'(,min)|'(xxQxxxQxxxAππ1Eric Xing 8Metropolis-Hastings1. Initialize x(0)2. While not mixing // burn-inzx=x(t)zt += 1,z sample u ~ Unif(0,1)z sample x* ~ Q(x*|x)-ifzx(t)= x* // transition-elsezx(t)= x// stay in current state z Reset t=0, for t =1:Nz x(t+1)) Å Draw sample (x(t))⎟⎟⎠⎞⎜⎜⎝⎛=<)|*()(*)|(*)(,min)|*(xxQxxxQxxxAuππ1Function Draw sample (x(t))5Eric Xing 9Mixing timez The εmixing time Tεis the minimal number of steps (from any starting distribution) until Dvar(P(T), π) ≤ε, where Dvaris the variational distance between the two distributions:z Chains with low bandwidth (conductance) regions of space take a long time to mix.z This arises for GMs with deterministic or highly skewed potentials.X1X2X3X4X5X7X6)()(sup),(defvarAASA2121µµµµ−=⊂DEric Xing 10MCMC example q(x*|x) ~ N(x(i),100)p(x) ~ 0.3 exp(-0.2x2) + 0.7 exp(-0.2(x-10)2)6Eric Xing 11Summary of MHz Random walk through state spacez Can simulate multiple chains in parallelz Much hinges on proposal distribution Qz Want to visit state space where p(X) puts massz Want A(x*|x) high in modes of p(X) z Chain mixes wellz Convergence diagnosisz How can we tell when burn-in is over?z Run multiple chains from different starting conditions, wait until they start “behaving similarly”.z Various heuristics have been proposed.Eric Xing 127Eric Xing 13Gibbs samplingz Gibbs sampling is an MCMC algorithm that is especially appropriate for inference in graphical models.z The proceduez we have variable set X={x1, x2, x3,... xN} for a GMz at each step one of the variables Xiis selected (at random or according to some fixed sequences), denote the remaining variables as X-i , and its current value as x-i(t-1)z Using the "alarm network" as an example, say at time t we choose XE, and we denote the current value assignments of the remaining variables,X-E, obtained from previous samples, as z the conditonal distributionp(Xi| x-i(t-1)) is computedz a value xi(t)is sampled from this distributionz the sample xi(t)replaces the previous sampled value of Xiin X.z i.e., {})()()()()(,,,11111 −−−−−−=tMtJtAtBtExxxxx)()()(tEtEtxxx∪=−−1Eric Xing 14Markov Blanketz Markov Blanket in BNz A variable is independent from others, given its parents, children and children‘s parents (d-separation).z MB in MRFz A variable is independent all its non-neighbors, given all its direct neighbors.⇒ p(Xi| X-i)= p(Xi| MB(Xi))z Gibbs samplingz Every step, choose one variable and sample it by P(X|MB(X)) based on previous sample.8Eric Xing 15Gibbs sampling of the alarm networkz To calculate P(J|B1,M1)z Choose (B1,E0,A1,M1,J1) as a startz Evidences are B1, M1, variables are A, E, J.z Choose next variable as Az Sample A by P(A|MB(A))=P(A|B1, E0, M1, J1) suppose to be false.z (B1, E0, A0, M1, J1)z Choose next random variable as E, sample E~P(E|B1,A0) z ...MB(A)={B, E, J, M}MB(E)={A, B}Eric Xing 16z Gibbs sampling is a special case of MHz The transition matrix updates each node one at a time using the
View Full Document