Machine learning: lecture 23Tommi S. JaakkolaMIT [email protected]• Exact inference (quickly)– message passing in junction trees• Approximate inference– belief propagation– sampling• Review for the final– what is important, what is notTommi Jaakkola, MIT CSAIL 2Exact inference: key steps• Baysian network, moralization, triangulationx1x2x3x5x4x1x2x3x5x4x1x2x3x5x4original graph moral graph already triangulated• Cliques, clique graph, and junction treex1x2x3x5x4c1c2c3c1c2c3x2, x3, x4x1, x2, x3x3, x4, x5x2, x3, x4c1c3x1, x2, x3x3, x4, x5c2s12s23x2, x3x3, x4cliques clique tree junction treeTommi Jaakkola, MIT CSAIL 3Exact inference: potentials• Associating graphs and potentialsx1x2x5x4P (x4|x2)P (x5|x3, x4)x3P (x1)P (x2)P (x3|x1, x2)x2, x3, x4c3x1, x2, x3x3, x4, x5c2s23x2, x3x3, x4c1P (x1)P (x2)P (x3|x1, x2)P (x5|x3, x4)1s121P (x4|x2)original graph w/ probabilities junction tree w/ probsP (x1, . . . , x5) =ψc1(x1, x2, x3)ψc2(x2, x3, x4)ψc3(x3, x4, x5)ψs12(x2, x3)ψs23(x3, x4)Tommi Jaakkola, MIT CSAIL 4Exact inference: message passing• Select a root clique• Collect evidencex2, x3, x4c1c3x1, x2, x3x3, x4, x5c2s12s23x2, x3x3, x4rootm3→2(x3, x4)m1→2(x2, x3)• Distribute e videncex2, x3, x4c1c3x1, x2, x3x3, x4, x5c2s12s23x2, x3x3, x4rootm2→1(x2, x3)m2→3(x3, x4)Tommi Jaakkola, MIT CSAIL 5Exact inference: message passing• Collect evidencex2, x3, x4c3x1, x2, x3x3, x4, x5c2s23x2, x3x3, x4c1P (x1)P (x2)P (x3|x1, x2)P (x5|x3, x4)1s121P (x4|x2)Tommi Jaakkola, MIT CSAIL 6Exact inference: message passing• Collect evidencex2, x3, x4c3x1, x2, x3x3, x4, x5c2s23x2, x3x3, x4c1P (x1)P (x2)P (x3|x1, x2)P (x5|x3, x4)1s121P (x4|x2)Evaluate new separators:ψ0s12(x2, x3) =Xx1ψc1(x1, x2, x3) = P (x2, x3)ψ0s23(x3, x4) =Xx5ψc3(x3, x4, x5) = 1Tommi Jaakkola, MIT CSAIL 7Exact inference: message passing• Collect evidencex2, x3, x4c3x1, x2, x3x3, x4, x5c2s23x2, x3x3, x4c1P (x1)P (x2)P (x3|x1, x2)P (x5|x3, x4)1s121P (x4|x2)Messages (not explicitly used in the algorithm):m1→2(x2, x3) =ψ0s12(x2, x3)ψs12(x2, x3)=P (x2, x3)1m3→2(x3, x4) =ψ0s23(x3, x4)ψs23(x3, x4)=11Tommi Jaakkola, MIT CSAIL 8Exact inference: message passing• Collect evidencex2, x3, x4c3x1, x2, x3x3, x4, x5c2s23x2, x3x3, x4c1P (x1)P (x2)P (x3|x1, x2)P (x5|x3, x4)1s121P (x4|x2)Update clique potentials (based on messages):ψc2(x2, x3, x4) ←ψ0s12(x2, x3)ψs12(x2, x3)| {z }m1→2(x2,x3)·ψ0s23(x3, x4)ψs23(x3, x4)·| {z }m3→2(x3,x4)ψc2(x2, x3, x4)= P (x2, x3) · 1 · P (x4|x2) = P (x2, x3, x4)followed by ψs12← ψ0s12and ψs23← ψ0s23Tommi Jaakkola, MIT CSAIL 9Exact inference: message passing• Distribute e videncex2, x3, x4c3x1, x2, x3x3, x4, x5c2s23x2, x3x3, x4c1P (x1)P (x2)P (x3|x1, x2)P (x5|x3, x4)1s12P (x2, x3)P (x2, x3, x4)Tommi Jaakkola, MIT CSAIL 10Exact inference: message passing• Distribute e videncex2, x3, x4c3x1, x2, x3x3, x4, x5c2s23x2, x3x3, x4c1P (x1)P (x2)P (x3|x1, x2)P (x5|x3, x4)1s12P (x2, x3)P (x2, x3, x4)Evaluate new separators:ψ0s12(x2, x3) =Xx4ψc2(x2, x3, x4) = P (x2, x3)ψ0s23(x3, x4) =Xx2ψc2(x2, x3, x4) = P (x3, x4)Tommi Jaakkola, MIT CSAIL 11Exact inference: message passing• Distribute e videncex2, x3, x4c3x1, x2, x3x3, x4, x5c2s23x2, x3x3, x4c1P (x1)P (x2)P (x3|x1, x2)P (x5|x3, x4)1s12P (x2, x3)P (x2, x3, x4)Messages (not explicitly used in the algorithm):m2→1(x2, x3) =ψ0s12(x2, x3)ψs12(x2, x3)=P (x2, x3)P (x2, x3)= 1m2→3(x3, x4) =ψ0s23(x3, x4)ψs23(x3, x4)=P (x3, x4)1Tommi Jaakkola, MIT CSAIL 12Exact inference: message passing• Distribute e videncex2, x3, x4c3x1, x2, x3x3, x4, x5c2s23x2, x3x3, x4c1P (x1)P (x2)P (x3|x1, x2)P (x5|x3, x4)1s12P (x2, x3)P (x2, x3, x4)Update c lique potentials (based on messages):ψc1(x1, x2, x3) ←ψ0s12(x2, x3)ψs12(x2, x3)ψc1(x1, x2, x3)= P (x1, x2, x3)ψc3(x3, x4, x5) ←ψ0s23(x3, x4)ψs23(x3, x4)· ψc3(x3, x4, x5) = P (x3, x4, x5)followed by ψs12← ψ0s12and ψs23← ψ0s23Tommi Jaakkola, MIT CSAIL 13Exact inference• After the collect and distribute steps the marginalprobabilities are stored locally at the clique potentials (andthe separators)x2, x3, x4c3x1, x2, x3x3, x4, x5c2s23x2, x3x3, x4c1P (x1, x2, x3)P (x3, x4, x5)s12P (x2, x3, x4)P (x2, x3)P (x3, x4)P (x1, . . . , x5) =P (x1, x2, x3)P (x2, x3, x4)P (x3, x4, x5)P (x2, x3)P (x3, x4)More generally, the resulting potentials would be proportionalto the posterior marginals, e.g., P (x1, x2, x3, data), whichcan be e asily normalized.Tommi Jaakkola, MIT CSAIL 14Outline• Exact inference (quickly)– message passing in junction trees• Approximate inference– belief propagation– sampling• Review for the final– what is important, what is notTommi Jaakkola, MIT CSAIL 15Approximate inference: motivation• We cannot solve the medical diagnosis problem(s) with e xactinference algorithms. . .. . .DiseasesFindingsdfFindingsDiseases– the largest c lique has over 100 variables (the correspondingpotential function or table would involve more than 2100elements)Tommi Jaakkola, MIT CSAIL 16Approximate inference: belief propagation• The message passing algorithm is appropriate when the modelis a (clique) tree– we need a unique path of influence between any (sets of)variables• We can still apply the message passing algorithm even if themodel is not a tree (message passing operations are definedlocally)x1x2x4x3m1→2(x2)m4→2(x2)– convergence?– accuracy?Tommi Jaakkola, MIT CSAIL 17Approximate inference: belief propagationx1x2x4x3m1→2(x2)m4→2(x2)– a set of locally consistent messages (fixed point of thealgorithm) always exists– the accuracy of the resulting marginals related to the lengthof the shortest cycle– stronger guarantees exist for finding most like lyconfigurations of variables• Works well in many large scale applications– decoding turbo (and other) codes, image processing,molecular networks, protein structure, etc.Tommi Jaakkola, MIT CSAIL 18Approximate inference: samplingx1x2x4x3• If we could draw samples xt= {xt1, xt2, xt3, xt4} from P (x),we could easily and acc urately evaluate any marginalsP (x1= 0) ≈1TTXt=1δ(xt1, 0)where δ(xt1, 0) = 1 whenever xt1= 0 and zero otherwise.• But it is hard to draw samples...Tommi Jaakkola, MIT CSAIL 19Simple remedy: importan ce sampling• We can instead draw samples from a much simplerdistribution Q(x) (e.g., where the variables may beindependent) and evaluate marginals according toP (x1= 0) =XxP (x)δ(x1, 0)=XxQ(x)P (x)Q(x)δ(x1, 0)≈1TTXt=1P (xt)Q(xt)δ(xt1, 0)where the
View Full Document