Unformatted text preview:

Stat260: Bayesian Modeling and Inference Lecture Date: April 14, 2010Lecture 20Lecturer: Michael I. Jordan Scribe: Venkatesan Ekambaram1 MCMC FoundationsDefinition 1 (Reversibility). Let (Xn) be positive (i.e. ψ irreducible and has an invariant probabilitymeasure), then it is said to be reversible if(Xn+1|Xn+2= X)d= (Xn+1|Xn= X). (1)Definition 2. (Xn) is said to satisfy the detailed balance conditions if there existis a function f such thatk(y, x)f(y) = k(x, y)f(x), (2)corresponding to the measure k(X, .).Theorem 3. Let (Xn) satisfy the detailed balance conditions, where f is the density corresponding to aprobability measure π. We then have1. π is the invariant measure of (Xn).2. (Xn) is reversible.Theorem 4. Let (Xn) be Harris positive. Suppose there exists an ergodic atom α, we then havelimn→∞||Kn(X, .) − π(.)||T V= 0, (3)where the norm is under the Total Variation norm.Theorem 5 (Ergodicity). Let π be a σ−finite invariant measure for (Xn). Let Sn(h) =1nPni=1h(Xi),h ∈ L∞(π), we then havelimn→∞Sn(h) =Zh(x)π(dx), (4)iff (Xn) is Harris recurrent.Ergodicity says that, even though there is memory in the system, the sample average goes to the expectedvalue under the invariant measure.2 Invariant measure for the Metropolis HastingsLet’s now analyze the Metrop olis Hastings algorithm which is a MCMC sampling algorithm and see if thesamples that we get are indeed from the desired density function. Let f (.) be the density of interest fromwhich we intend to sample and let q(.|.) be the proposal density. The acceptance probability is given byρ(x, y) = min1,f(y)q(x|y)f(x)q(y|x). (5)12 Lecture 20The kernel for this is then given byk(x, y) = q(y|x)ρ(x, y) + δx(y)(1 − r(x)), (6)where r(x) =Rρ(x, y)q(y|x)dy and δx(y) is the standard dirac-delta function. Let’s now see how this kernelcomes about. We have the followingK(x(t), A) = P (X(t+1)∈ A|X(t)= x(t)) (7)Letzt=1 if x(t)is accepted0 otherwise(8)We can now express K(x(t), A) as followsK(x(t), A) = P (X(t+1)∈ A, zt= 1|X(t)= x(t)) + P (X(t+1)∈ A, zt= 0|X(t)= x(t))= P (zt= 1|X(t)= x(t), X(t+1)∈ A)P (X(t+1)∈ A|X(t)= x(t)) +P (zt= 0|X(t)= x(t), X(t+1)∈ A)P (X(t+1)∈ A|X(t)= x(t))=ZAq(y|x(t))ρ(x(t), y)dy + δA(x(t))ZA(1 − ρ(x(t), y))q(y|x(t))dyThus we get k(x, y) as defined before.Let’s now check if the detailed balance equations are satisfied. Consider the first term of k(x, y)f(x). Wehaveq(y|x)ρ(x, y)f (x) = f (x)min1,f(y)q(x|y)f(x)q(y|x)q(y|x)= minf(x),f(y)q(x|y)q(y|x)q(y|x)= minf(x)f(y)q(y|x)q(x|y), 1f(y)q(x|y)= q(x|y)ρ(y, x)f(y).The second term of k(x, y)f(x) is given byf(x)δx(y)(1 − r(x)) = f(y)δy(x)(1 − r(y)). (9)From the above two equations we can see that the detailed balance condition is satisfied. We are yet to showπ−irreducibility where π is the measure obtained from f . However this is more problem specific.2.1 Independence samplerA special case of the Metropolis Hastings sampler is the Independence sampler. For this we have q(y|x) = q(y)and q(y) is typically taken to be heavy tailed. This sampler is closer to the rejection sampler. However onemain difference is that, in rejection sampling if we reject the sample, we do not retain the sample. However,in this case if we reject, then we retain the sample. In rejection sampling the proposal distribution q(x), istaken so that f (x) ≤ M q(x). The probability of acceptance in rejection sampling is1M. For independencesampling, the probability of acceptance depends on the current sample and is a random variable. It canbe shown that E(P (acceptance)) ≥1M. Hence in independence sampling we cannot lose more samples thanthat in rejection sampling.Lecture 20 32.2 More general kernelsIn general, one could have cycles and mixtures of kernels. Let K1, K2, ..., Krbe r kernels. We can have acomposition of these kernels to get˜K.˜K = K1◦ K2◦ .... ◦ Kr. (10)Under some weak conditions, if any one of these kernels are irreducible, then a Metropolis algorithm usingsuch a kernel can also be shown to converge. One can also show that if a kernel that is a mixture of otherkernels is used i.e.K∗ = d1K1+ d2K2+ ... + drKr,Xdi= 1, di> 0, (11)then also the algorithm converges.2.3 Random GibbsThis algorithm is similar to the basic Gibbs algorithm. However as opposed to updating all the samplesin the (t + 1)thstep, we would now select a component i randomly and update only the sample X(t+1)i∼p(X(t+1)i|X(t)j6=i). This is also a Metropolis Algorithm with the acceptance ratioρ = min 1,p(X(t+1)i, X(t)j6=i)p(X(t)i, X(t)j6=i)p(X(t)i|X(t)j6=i)p(X(t+1)i|X(t)j6=i)!= 1. (12)Thus the Gibbs kernel gives an acceptance rate of one.However one should be aware that the basic Gibbs algorithm in itself is not a Metropolis Hastings algorithm.By imposing an order in updating the samples, we lose on the irreducibility. But this can be shown to havean invariant distribution, if we use a deterministic sequence of distributions. Consider the following kernelK(x, x′) = p(X′1|X2, .., Xp)p(X′2|X′1, X3, ..., Xp)....p(X′p|X′1, X′2, ..., X′p−1). (13)We need to see if the given sequence of updates gives us samples X′that are from the true distributionp(X′∈ A). Consider the followingZ1A(X′)K(X, X′)p(X)dXdX′=Z1A(X′)p(X′1|X2, .., Xp)p(X′2|X′1, X3, ..., Xp)....p(X′p|X′1, X′2, ..., X′p−1)p(X1|X2, ..., Xp)p(X2, ..., Xp)dX1...dXpdX′=Z1A(X′)p(X′1|X2, .., Xp)p(X′2|X′1, X3, ..., Xp)....p(X′p|X′1, X′2, ..., X′p−1)p(X2, ..., Xp)dX2...dXpdX′=Z1A(X′)p(X′1, X2, .., Xp)p(X′2|X′1, X3, ..., Xp)....p(X′p|X′1, X′2, ..., X′p−1)dX2...dXpdX′=Z1A(X′)p(X2|X′1, X3, ..., Xp).....dX2...dXpdX′=Z1A(X′)p(X′)dX′= p(X′∈ A)4 Lecture 203 Rao-BlackwellizationSuppose the samples can be grouped as (X(t), Y(t)) and X(t)are the only samples of interest and weintend to compute the mean value of some function h(X). A naive estimator would get samples fromboth (X(t), Y(t)) and compute the estimate1TPTt=1h(X(t)). A Rao-Blackwellized sampler would compute1TPTt=1E(h(X)|Y(t)). Let’s consider the variance of the samples. Then by the total variance rule we haveVar(X) ≥ Var(E(X|Y )). However this is only true for two random variables (X, Y ) and does not extend tovectors.4 Improper priorsOne should be careful that the theory holds true for any measure and not necessarily only the probabilitymeasure. Hence it is possible that we may converge to an improper posterior. For example


View Full Document

Berkeley COMPSCI 260A - Lecture Notes

Download Lecture Notes
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture Notes and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture Notes 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?