Berkeley COMPSCI 260A - Lecture 16 - D436726

Home> Schools> University of California, Berkeley> Computer Science (COMPSCI) > COMPSCI 260A> Lecture 16

Berkeley COMPSCI 260A - Lecture 16

School name University of California, Berkeley

Course Compsci 260a- User Interface Design and Development

Pages 5

Download Save

Unformatted text preview:

Stat260: Bayesian Modeling and Inference Lecture Date: March 31, 2010Lecture 16Lecturer: Michael I. Jordan Scribe: Samitha Samaranayake1 Laplace approximation reviewIn the previous lecture we discussed the Laplace approximation as a general way to approach marginalizationproblems. The basic idea was to approximation an integral of the following form:I(t) =Ze−Nh(x)dx, (1)where N is typically the number of data points. After performing a Taylor series expansion of both h(x)and the exponential function and evaluating some elementary integrals, we showed that the following ap-proximation of I(t) can be derived.I(N) = e−Nh(ˆx)√2πσN−1/21 −h4(ˆx)σ48N+5h23(ˆx)σ624N+ O(1/N2), (2)where σ2= 1/h2(ˆx) and ˆx = argminxh(x). If ˆx can not be determined analytically, it is typically approx-imated with some value ˜x such that the approximation error of ˆx − ˜x is within a factor of O(1/N). Forexample, in a Bayesian application −h(x) can be the likelihood times a prior and ˆx is then the maximum aposteriori probability (MAP). As N gets large, the MAP approaches the maximum likelihood (ML) estimate,so we can approximate ˆx with the MLE and still obtain a rigorous accuracy bound.2 Multivariate Laplace approximationThe multivariate case is derived in exactly the same way as the univariate case was derived in lecture 15.The only difference b e ing that we perform a multivariate Taylor series expansion and get a multivariateGaussian integral. Letting x denote a d-dimensional vector and h(x) a scalar function of x, we obtain:Ze−Nh(x)dx ≈ e−Nh(ˆx)(2π)d/2|Σ|1/2N−d/2, (3)where Σ = (D2h(ˆx))−1is the inverse of the Hessian of h evaluated at ˆx. This expansion is accurate to orderO(1/N), since we only consider the first order terms of the Laplace approximation. However, as in Eq. (2),the expansion can be continued to obtain an accuracy of order O(1/N2).12 Lecture 163 Marginal likelihoodOne application of the Laplace approximation is to compute the marginal likelihood. Letting M be themarginal likelihood we have,M =ZP (X|θ)π(θ) dθ=Zexp−N−1Nlog P (X|θ) −1Nlog π(θ)dθ (4)where, h(θ) = −1Nlog P (X|θ) −1Nlog π(θ). Using the Laplace approximation up to the first order as inEq. (3) we get,M ≈ P (X|ˆθ)π(ˆθ)(2π)d/2|Σ|1/2N−d/2(5)This approximation is used for example in model selection, where computing the marginal likelihood ana-lytically can be hard unless there is conjugacy. Computing the Laplace approximation requires finding themaximum a posteriori probabilityˆθ = argmaxθ−h(x), which can be done using a standard method such asgradient search. It also requires computing the sec ond derivative matrix and inverting it to obtain Σ. Thisis usually the harder quantity to calculate.4 Bayesian information criterion (BIC) scoreThe Baye sian information criterion1score tries to minimize the impact of the prior as much as possible.Therefore,ˆθMAPis replaced with the value that maximizes the maximum likelihood (ˆθML), a reasonableapproximation if the prior does not dominate. Taking the log of Eq. (5) we obtain the log marginal likelihood,M ≈ log P (X|ˆθML) + log π(ˆθ) + (d/2) log(2π) + (1/2) log |Σ|−(d/2) log N (6)The BIC score only retains the terms that vary in N, since asymptotically the terms that are constant in Ndo not matter. Dropping the constant terms we get,M ≈ log P (X|ˆθML) − (d/2) log N (7)In the model selection problem, we pick the model with the highest BIC score. Frequentist analysis showsthat the BIC score is an asymptotically consistent model selection procedure under weak conditions. Notethat there is no prior π(θ) in this estimate, so it is clearly not a B ayesian procedure. The BIC score ispart of a family of competing penalized likelihood scores that also include the AIC, DIC and TIC scores.These scores differ mostly in the model complexity term −(d/2) log N , where d is the dimensionality of θ,and penalizes models with higher complexity. The AIC score does not have a log N term and has a fixeddimensionality penalty that allows for more complex models. The goal of the AIC score is to predict thenext value and it can be show that it is optimal in the sense of minimizing the Kullback−Leibler (KL)divergence. However, when it comes to model selection, unlike the BIC score, the AIC score is not consistentasymptotically. It should be noted that this method does not provide a particularly good approximation ofthe marginal likelihood, but is presented as an example of using the Laplace approximation.Lecture 16 3Figure 1: The standard random effects graphical model5 Full Bayes versus empi ri cal BayesUsing the standard model from Figure 1, we are now interested in the inference for some function of θ. Forgenerality let us consider g(θ) to be an arbitrary function of g. Since θiis independent of Yj6=igiven λ wehave,E(g(θi)|Y ) = Eθ(E(g(θi)|Yi, λ))V ar(g(θi)|Y ) = Eλ(V ar(g(θi)|Yi, λ)) + V arλ(E(g(θi)|Yi, λ)) (8)We are interested in comparing the magnitudes of the asymptotic expansions of the expected value and thevariance (Bayesian error bars), under empirical Bayes and full Bayes. Every expectation in these two termsinvolves computing an integral, w hich we will approximate using the Laplace method.5.1 Full BayesConsider E[G(λ)|Y ] for some arbitrary G. As an example, G will be the identity function when computingthe mean and the square function when computing the second moment.E[G(λ)|Y ] =RG(λ)L(λ)π(λ) dλRL(λ)π(λ) dλ(9)where the likelihood L(λ) = ΠKi=1Li(λ) and Li(λ) = P (Yi|λ) =RP (yi|θi, λ) dθi.Computing this ratio of integrals is a major application of the Laplace method in Bayesian statistics. Thedenominator has the form of a likelihood term times a prior term, which is identical to what we have alreadyseen in the marginal likelihood case and can be solved using the standard Laplace approximation. However,the numerator has an extra term. One way to solve this would be to fold in G(λ) into h(λ) and use thestandard approach, but another approach would be to use a more generalized Laplace formulation. Beingconsistent with the notation used by Kass and Steffey (1989), we define the generalized Laplace integralasRb(λ) exp(−Kh(λ)) dλ. B y performing a Taylor series expansion in b(λ), in addition to the expansions inh(λ) and the exponential function, we obtain the following approximation to the first order of the decayingterms:Zb(λ)

View Full Document


School:
Email:
New Password:
Confirm Password:

Berkeley COMPSCI 260A - Lecture 16

Sign up for free to view:

Please select your school