DOC PREVIEW
Gibbs Sampling for the Uninitiated

This preview shows page 1-2-22-23 out of 23 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 23 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 23 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 23 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 23 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 23 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Gibbs Sampling for the UninitiatedPhilip ResnikDepartment of Linguistics andInstitute for Advanced Computer StudiesUniversity of MarylandCollege Park, MD 20742 [email protected] HardistyDepartment of Computer Science andInstitute for Advanced Computer StudiesUniversity of MarylandCollege Park, MD 20742 [email protected] 20, 2009VERSION 0.3AbstractThis document is intended for computer scientists who would like to try out a Markov Chain MonteCarlo (MCMC) technique, particularly in order to do inference with Bayesian models on problems relatedto text processing. We try to keep theory to the absolute minimum needed, and we work through thedetails much more explicitly than you usually see even in “introductory” explanations. That means we’veattempted to be ridiculously explicit in our exposition and notation.After providing the reasons and reasoning behind Gibbs sampling (and at least nodding our headsin the direction of theory), we work through two applications in detail. The first is the derivation of aGibbs sampler for Naive Bayes models, which illustrates a simple case where the math works out verycleanly and it’s possible to “integrate out” the model’s continuous parameters to build a more efficientalgorithm. The second application derives the Gibbs sampler for a model that is similar to Naive Bayes,but which adds an additional latent variable. Having gone through the two examples, we discuss somepractical implementation issues. We conclude with some pointers to literature that we’ve found to besomewhat more friendly to uninitiated readers.1 IntroductionMarkov Chain Monte Carlo (MCMC) techniques like Gibbs sampling provide a principled way to approximatethe value of an integral.1.1 Why integrals?Ok, stop right there. Many computer scientists, including a lot of us who focus in natural language processing,don’t spend a lot of time with integrals. We spend most of our time and energy in a world of discrete events.(The word bank can mean (1) a financial institution, (2) the side of a river, or (3) tilting an airplane. Whichmeaning was intended, based on the words that appear nearby?) Take a look at Manning and Schuetze[16], and you’ll see that the probabilistic models we use tend to involve sums, not integrals (the Baum-Welchalgorithm for HMMs, for example). So we have to start by asking: why and when do we care about integrals?One good answer has to do with probability estimation.1Numerous computational methods involveestimating the probabilities of alternative discrete choices, often in order to pick the single most probablechoice. As one example, the language model in an automatic speech recognition (ASR) system estimatesthe probability of the next word given the previous context. As another example, many spam blockers use1This subsection is built around the very nice explication of Bayesian probability estimation by Heinrich [8].1Plot!p^4"1 ! p#^6, $p,0,1%&Input interpretation:plot p4"1 ! p#6p ! 0 to 1Plot:0.20.40.60.81.00.00020.00040.00060.00080.00100.0012 Generated by Wolfram|Alpha (www.wolframalpha.com) on September 24, 2009 from Champaign, IL.© Wolfram Alpha LLC—A Wolfram Research Company1Figure 1: Probability of generating the coin-flip sequence HHHHTTTTTT, using different values forPr(heads) on the x-axis. The value that maximizes the probability of the observed sequence, 0.4, is themaximum likelihood estimate (MLE).features of the e-mail message (like the word Viagra, or the phrase send this message to all your friends) topredict the probability that the message is spam.Sometimes we estimate probabilities by using maximimum likelihood estimation (MLE). To use a standardexample, if we are told a coin may be unfair, and we flip it 10 times and see HHHHTTTTTT (H=heads,T=tails), it’s conventional to estimate the probability of heads for the next flip as 0.4. In practical terms,MLE amounts to counting and then normalizing so that the probabilities sum to 1.count(H)count(H) + count(T)=410= 0.4 (1)Formally, MLE produces the choice most likely to have generated the observed data.In this case, the most natural model µ has just a single parameter, π, namely the probability of heads(see Figure 1).2Letting X = HHHHTTTTTT represent the observed data, and y the outcome of the nextcoin flip, we estimate˜πMLE= argmaxπPr(X |π) (2)Pr(y|X ) ≈ Pr(y|˜πMLE) (3)On the other hand, sometimes we estimate probabilities using maximum a posteriori (MAP) estimation.A MAP estimate is the choice that is most likely given the observed data. In this case,˜πMAP= argmaxπPr(π|X )= argmaxπPr(X |π) Pr(π)Pr(X )= argmaxπPr(X |π) Pr(π) (4)Pr(y|X ) ≈ Pr(y|˜πMAP) (5)2Specifically, µ models each choice as a Bernoulli trial, and the probability of generating exactly this heads-tails sequence fora given π is π4(1−π)6. If you type Plot[p^4(1-p)^6,{p,0,1}] into Wolfram Alpha, you get Figure 1, and you can immediatelysee that the curve tops out, i.e. the probability of generating the sequence is highest, exactly when p = 0.4. Confirm this byentering derivative of p^4(1-p)^6 and you’ll get25= 0.4 as the maximum. Thanks to Kevin Knight for pointing out howeasy all this is using Wolfram Alpha. Also see discussion in Heinrich [8], Section 2.1.2In contrast to MLE, MAP estimation applies Bayes’s Rule, so that our estimate (4) can take into accountprior knowledge about what we expect π to be in the form of a prior probability distribution Pr(π).3So,for example, we might believe that the coin flipper is a scrupulously honest person, and choose a priordistribution that is biased in favor of Pr(π) = 0.5. The more heavily biased that prior distribution is, themore evidence it will take to shake our pre-existing belief that the coin is fair.4Now, MLE and MAP estimates are both giving us the best estimate, according to their respectivedefinitions of “best.” But notice that using a single estimate — whether it’s ˜πMLEor ˜πMAP— throws awayinformation. In principle, π could have any value between 0 and 1; mightn’t we get better estimates if wetook the whole distribution p(π|X ) into account, rather than just a single estimated value for π? If we dothat, we’re making use of all the information about π that we can wring from the observed X .The way to take advantage of all that information is to calculate an expected value rather than an estimateusing the single best guess for π. Recall that the expected value of a function f (z), when


Gibbs Sampling for the Uninitiated

Download Gibbs Sampling for the Uninitiated
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Gibbs Sampling for the Uninitiated and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Gibbs Sampling for the Uninitiated 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?