CMU CS 15780 - lecture - D1871434

Home> Schools> Carnegie Mellon University> Computer Science (CS) > CS 15780> lecture

CMU CS 15780 - lecture

Course Cs 15780- Graduate Artificial Intelligence

Pages 66

Download Save

Unformatted text preview:

15-780: Grad AILecture 20: Monte Carlo methods, Bayesian learningGeoff Gordon (this lecture)Tuomas Sandholm TAs Erik Zawadzki, Abe OthmanAdminReminder: midterm March 29‣Tuomas’s review session tomorrow, mine yesterdayReminder: project milestone reports due March 31Review: factor graphsUndirected, bipartite graph‣one set of nodes represents variables‣other set represents factors in probability distribution—tables of nonnegative numbers‣need to compute normalizer in order to do anything usefulCan convert back and forth to Bayes netsHard v. soft constraintsReview: factor graphsGraphical test for independence‣different results from Bayes net, even if we are representing the same distributionInference by dynamic programming‣instantiate evidence, eliminate nuisance nodes, normalize, answer query‣elimination order matters‣treewidthRelation to logicReview: HMMs, DBNsInference over time‣same graphical template repeated once for each time step—conceptually infiniteInference: forward-backward algorithm (special case of belief propagation)Review: numerical integrationIntegrate a difficult function over a high-dimensional volume‣narrow, tall peaks contribute most of the integral—difficult search problemCentral problem for approximate inference‣e.g., computing normalizing constant in a factor graphUniform samplingï1 ï0.5 0 0.5 10102030405060702N�if(xi)Importance samplingï1 ï0.5 0 0.5 1010203040506070VarianceHow does this help us control variance?Suppose f big ==> Q bigAnd Q small ==> f smallThen h = f/Q never gets too bigVariance of each sample is lower ==> need fewer samplesA good Q makes a good ISImportance sampling, part IISupposef(x)=R(x)g(x)�f(x)dx =�R (x)g(x)dx= ER[g(x)]Importance sampling, part IIUse importance sampling w/ proposal Q(X):‣Pick N samples xi from Q(X)‣Average wi g(xi), where wi = R(xi)/Q(xi) is importance weightEQ(Wg(X)) =�Q(x)R(x)Q(x)g(x)=�R(x)g(x)dx=�f(x)dxParallel ISNow suppose R(x) is unnormalized (e.g., represented by factor graph)—know only Z R(x)Pick N samples xi from proposal Q(X)If we knew wi = R(xi)/Q(xi), could do ISInstead, set ˆwi= ZR(xi)/Q(xi)Parallel ISSo, is an unbiased estimate of Z¯w =1N�iˆwiE(ˆW )=�Q(x)ZR(x)Q(x)dx=�ZR(x)dx= ZParallel ISSo, is an estimate of wi, computed without knowing ZFinal estimate:ˆwi/ ¯w�f(x)dx ≈1n�iˆwi¯wg(xi)Parallel IS is biased0 1 2 300.511.522.53mean(weights)1 / mean(weights)E(mean(weights))E(¯W )=Z, but E(1/¯W ) �=1/Z in generalï2 ï1 0 1 2ï2ï1012Q :(X, Y ) ∼ N(1, 1) θ ∼ U(−π, π)f(x, y, θ)=Q(x, y, θ)P (o =0.8 | x, y, θ)/Zï2 ï1 0 1 2ï2ï1012PosteriorE(X, Y, θ) = (0.496, 0.350, 0.084)MCMCIntegration problemRecall: wantedAnd therefore, wanted good importance distribution Q(x) (close to R)�f(x)dx =�R(x)g(x)dxBack to high dimensionsPicking a good importance distribution is hard in high-DMajor contributions to integral can be hidden in small areas‣recall, want (R big ==> Q big)Would like to search for areas of high R(x)But searching could bias our estimatesMarkov-Chain Monte CarloDesign a randomized search procedure M over values of x, which tends to increase R(x) if it is smallRun M for a while, take resulting x as a sampleImportance distribution Q(x)?Markov-Chain Monte CarloDesign a randomized search procedure M over values of x, which tends to increase R(x) if it is smallRun M for a while, take resulting x as a sampleImportance distribution Q(x)?‣Q = stationary distribution of M…Stationary distributionRun HMM or DBN for a long time; stop at a random pointDo this again and againResulting samples are from stationary distributionDesigning a search chainWould like Q(x) = R(x)‣makes importance weight = 1Turns out we can get this exactly, using Metropolis-Hastings�f(x)dx =�R(x)g(x)dxMetropolis-HastingsWay of designing chain w/ Q(x) = R(x)Basic strategy: start from arbitrary xRepeatedly tweak x to get x’If R(x’) ! R(x), move to x’If R(x’) << R(x), stay at xIn intermediate cases, randomizeProposal distributionLeft open: what does “tweak” mean?Parameter of MH: Q(x’ | x)‣one-step proposal distributionGood proposals explore quickly, but remain in regions of high R(x)Optimal proposal?MH algorithmSample x’ ~ Q(x’ | x)Compute p =With probability min(1, p), set x := x’Repeat for T steps; sample is x1, …, xT (will usually contain duplicates)R (x�)R (x)Q(x�| x)Q(x | x�)MH algorithmSample x’ ~ Q(x’ | x)Compute p =With probability min(1, p), set x := x’Repeat for T steps; sample is x1, …, xT (will usually contain duplicates)note: we don’t need to know ZR (x�)R (x)Q(x�| x)Q(x | x�)MH exampleï1 ï0.5 0 0.5 1ï1ï0.500.51Acceptance rateMoving to new x’ is acceptingWant acceptance rate (avg p) to be large, so we don’t get big runs of the same xWant Q(x’ | x) to move long distances (to explore quickly)Tension between Q and P(accept):p =R (x�)R (x)Q(x�| x)Q(x | x�)Mixing rate, mixing timeIf we pick a good proposal, we will move rapidly around domain of R(x)After a short time, won’t be able to tell where we started—we have reached stationary dist’nThis is short mixing time = # steps until we can’t tell which starting point we usedMixing rate = 1 / (mixing time)MH estimateOnce we have our samples x1, x2, …Optional: discard initial “burn-in” range‣allows time to reach stationary dist’nEstimated integral:1NN�i=1g(xi)In exampleg(x) = x2True E(g(X)) = 0.28…Proposal: Acceptance rate 55–60%After 1000 samples, minus burn-in of 100:final estimate 0.282361final estimate 0.271167final estimate 0.322270final estimate 0.306541final estimate 0.308716Q(x�| x)=N(x�| x, 0.252I)Gibbs samplerSpecial case of MHDivide X into blocks of r.v.s B(1), B(2), …Proposal Q:‣pick a block i uniformly (or round robin, or any other schedule)‣sample XB(i) ~ P(XB(i) | X¬B(i))Gibbs exampleï0.8ï0.6ï0.4 ï0.20 0.20.40.6 0.81 1.2ï0.8ï0.6ï0.4ï0.200.20.40.60.8Gibbs exampleï1.5ï1 ï0.5 0 0.51 1.5ï1ï0.500.51Why is Gibbs useful?For Gibbs, p = P (x�i,x�¬i)P (xi,x¬i)P (xi| x�¬i)P (x�i| x¬i)Gibbs derivationP (x�i,x�¬i)P (xi,x¬i)P (xi| x�¬i)P (x�i| x¬i)=P (x�i,x¬i)P (xi,x¬i)P (xi| x¬i)P (x�i| x¬i)=P (x�i,x¬i)P (xi,x¬i)P (xi,x¬i)/P (x¬i)P (x�i,x¬i)/P (x¬i)=1Gibbs in practiceProof of p=1 means Gibbs is often easy to implementOften works well‣if we choose good blocks (but there may be no good blocking!)Fancier

View Full Document


School:
Email:
New Password:
Confirm Password:

CMU CS 15780 - lecture

Sign up for free to view:

Please select your school