15-780: Grad AILecture 19: Graphical models, Monte Carlo methodsGeoff Gordon (this lecture)Tuomas Sandholm TAs Erik Zawadzki, Abe OthmanAdminReminder: midterm March 29Reminder: project milestone reports due March 31Review: scenariosConverting QBF+ to PBI/MILP by scenarios‣Replicate decision variables for each scenario‣Replicate clauses: share first stage vars; set scenario vars by scenario index; replace decision vars by replicates‣Sample random scenariosExample: PSTRIPSReview: dynamic programmingSolving #SAT by dynamic programming (variable elimination)‣repeatedly move sums inward, combine tables, sum out‣treewidth and runtime/spaceReview: graphical modelsBayes net = DAG + CPTs‣For each RV (say X), there is one CPT specifying P(X | pa(X))‣Can simulate with propositional logic + random causesInference: similar to #SAT DP—move sums inward‣Can do partly analytically‣Allows us to prove independences and conditional ind’s from DAG aloneReview: graphical modelsBlocking, explaining awayMarkov blanketLearning: counting, Laplace smoothing‣if hidden variables: take 10-708 or use a toolboxMarkov blanketMarkov blanket of C = minimal set of obs’ns to make C independent of rest of graphFactor graphsAnother common type of graphical modelUses undirected, bipartite graph instead of DAGRusty robot: factor graphP(M) P(Ra) P(O) P(W|Ra,O) P(Ru|M,W)ConventionDon’t need to show unary factorsWhy? They don’t affect algorithms below.Non-CPT factorsJust saw: easy to convert Bayes net → factor graphIn general, factors need not be CPTs: any nonnegative #s allowedIn general, P(A, B, …) =Z =Hard v. soft factors012012111113133XY012012000001011XYHard SoftFactor graph → Bayes netConversion possible, but more involved‣Each representation can handle any distribution‣But, size/complexity of graph may differ2 cases for conversion:‣without adding nodes:‣adding nodes:IndependenceJust like Bayes nets, there are graphical tests for independence and conditional independenceSimpler, though:‣Cover up all observed nodes‣Look for a pathIndependence exampleModeling independenceTake a Bayes net, list the (conditional) independencesConvert to a factor graph, list the (conditional) independencesAre they the same list?What happened?InferenceInference: prior + evidence → posteriorWe gave examples of inference in a Bayes net, but not a general algorithmReason: general algorithm uses factor-graph representationSteps: instantiate evidence, eliminate nuisance nodes, normalize, answer queryInferenceTypical Q: given Ra=F, Ru=T, what is P(W)?Incorporate evidenceCondition on Ra=F, Ru=TEliminate nuisance nodesRemaining nodes: M, O, WQuery: P(W)So, O&M are nuisance—marginalize awayMarginal =Elimination orderSum out the nuisance variables in turnCan do it in any order, but some orders may be easier than othersLet’s do O, then MOne last eliminationChecking our workhttp://www.aispace.org/bayes/version5.1.6/bayes.jnlpDiscussionSteps: instantiate evidence, eliminate nuisance nodes, normalize, answer query‣each elimination introduces a new table, makes some old tables irrelevantNormalizationEach elim. order introduces different tables‣some tables bigger than othersFLOP count; treewidthTreewidth examplesChain TreeTreewidth examplesParallel chainsCycleDiscussionSeveral relationships between GMs and logic (similar DP algorithm, use of independent choices + logical consequences to represent a GM, factor graph with 0-1 potentials = CSP, MAP assignment = ILP)Directed v. undirected: advantages to bothLifted reasoning‣Propositional logic + objects = FOL‣FO GMs are a current hot topic of research (plate models, MLNs, ICL)—not solved yet!Discussion: belief propagationSuppose we want all 1-variable marginalsCould do N runs of variable eliminationOr: the BP algorithm simulates N runs for the price of 2For details: Kschischang et al. readingHMMs and DBNsInference over timeConsider a robot: ‣true state (x, y, θ)‣controls (v, w)‣N range sensors (here N=2: r, s)Modelxt+1= xt+ vtcos θt+ noiseyt+1= yt+ vtsin θt+ noiseθt+1= θt+ wt+ noisert=�(xt− xR)2+(yt− yR)2+ noisest=�(xt− xS)2+(yt− yS)2+ noiseModel of x, y, θ (r, s unobserved)Goal: inference over timeN=1 sensor, repeatedly observe range = 1m + noiseFactor graphDynamic Bayes NetworkDBN: factor graph composed of a single structural unit repeated over time‣conceptually infinite to right, but in practice cut off at some maximum TFactors must be conditional distributions∀xt,yt,θt,ut,vt�xt+1,yt+1,θt+1φ(xt,yt,θt,ut,vt,xt+1,yt+1,θt+1)=1∀xt,yt,θt�rt,stφ(xt,yt,θt,rt,st)=1Three kinds of variableControl (connects only forward)StateObservationCondition on obs, do(control)ControlStateObservationCondition on obs, do(control)ControlStateObservationSimplified versionState: xt ∈ {1, 2, 3}Observation: yt ∈ {L, H}Control: just one (i.e., no choice)—“keep going”Hidden Markov ModelsThis is an HMM—a DBN with:‣one state variable‣one observation variablePotentials1231230.70.300.30.30.300.30.7Xt+1XtLH1230.670.330.50.50.330.67YtXtHMM inferenceCondition on y1 = H, y2 = H, y3 = LWhat is P(X2 | HHL)?HMM factors after conditioningEliminate x1 and x3Multiply remaining potentials and renormalizeForward-backwardYou may recognize the above as the forward-backward algorithmSpecial case of dynamic programming / variable elimination / belief propagationApproximate InferenceMost of the time…Treewidth is bigVariables are high-arity or continuousCan’t afford exact inferenceNeed numerical integration (and/or summation)We’ll look at randomized algorithmsNumerical integrationï1ï0.500.51ï1ï0.8ï0.6ï0.4ï0.200.20.40.60.81012345YXf(X,Y)Integration in 1000s of dimsEliazar and Parr, IJCAI-03Simple 1D problemï1 ï0.5 0 0.5 1010203040506070Uniform samplingï1 ï0.5 0 0.5 10102030405060702N�if(xi)Uniform samplingSo, V E(f(X)) is desired integralBut standard deviation can be bigCan reduce it by averaging many samplesBut only at rate 1/sqrt(N)E(f(X)) =�P (x)f(x)dx=1V�f(x)dxImportance samplingInstead of x ~ uniform, use x ~ Q(x)Q = importance distributionShould have Q(x) large where f(x) is largeProblem:EQ(f(X)) =�Q(x)f(x)dxImportance samplingh(x) ≡ f(x)/Q(x)EQ(h(X)) =�Q(x)h(x)dx=�Q(x)f(x)/Q(x)dx=�f(x)dxImportance samplingSo, take samples of h(X) instead of f(X)wi = 1/Q(xi) is importance weightQ = 1/V yields uniform samplingImportance samplingï1 ï0.5 0 0.5 1010203040506070VarianceHow does this help us control
View Full Document