MIT HST 950J - Lecture Notes - D965883

Home> Schools> Massachusetts Institute of Technology> (HST) > HST 950J> Lecture Notes

MIT HST 950J - Lecture Notes

School name Massachusetts Institute of Technology

Pages 26

Download Save

Unformatted text preview:

Bayes Networks 6.872/HST.950What Probabilistic Models Should We Use? • Full joint distribution • Completely expressive • Hugely data-hungry • Exponential computational complexity • Naive Bayes (full conditional independence) • Relatively concise • Need data ~ (#hypotheses) × (#features) × (#feature-vals) • Fast ~ (#features) • Cannot express dependencies among features or among hypotheses • Cannot consider possibility of multiple hypotheses co -occurringBayesian Networks (aka Belief Networks) • Graphical representation of dependencies among a set of random variables • Nodes: variables • Directed links to a node from its parents: direct probabilistic dependencies • Each Xi has a conditional probability distribution, P(Xi|Parents(Xi)), showing the effects of the parents on the node. • The graph is directed (DAG); hence, no cycles. • This is a language that can express dependencies between Naive Bayes and the full joint distribution, more concisely • Given some new evidence, how does this affect the probability of some other node(s)? P(X|E) —belief propagation/updating • Given some evidence, what are the most likely values of other variables? —MAP explanationBurglary Network (due to J. Pearl) Alarm Burglary Earthquake JohnCalls MaryCallsBurglary Network (due to J. Pearl) Alarm Burglary Earthquake JohnCalls MaryCalls P(B) 0.001 P(E) 0.002 B E P(A|B,E) t t 0.95 t f 0.94 f t 0.29 f f 0.001 A P(M|A) t 0.70 f 0.01 A P(J|A) t 0.90 f 0.05If everything depends on everything Alarm Burglary Earthquake JohnCalls MaryCalls 2423 22 2021 • This model requires just as many parameters as the full joint distribution!Computing the Joint Distribution from a Bayes Network • As usual, we abuse notation: • • E.g., what’s the probability that an alarm has sounded, there was neither an earthquake nor a burglary, and both John and Mary called?Requirements for Constructing a BN • Recall that the deﬁnition of the conditional probability was • and thus we get the chain rule, • Generalizing to n variables, • and repeatedly applying this idea, • This “works” just in case we can deﬁne a partial order so thatTopological Interpretations X Y1 Yn UnU1 Z1 Zn A node, X, is conditionally independent of its non-descendants, Zi, given its parents, Ui. X Y1 Yn UnU1 Z1 Zn A node, X, is conditionally independent of all other nodes in the network given its Markov blanket: its parents, Ui, children,Yi, and children’s parents, Zi.BN’s can be Compact • For a network of 40 binary variables, the full joint distribution has 240 entries (> 1,000,000,000,000) • If |Par(xi)| ≤ 5, however, then the 40 (conditional) probability tables each have ≤ 32 entries, so the total number of parameters ≤ 1,280 • Largest medical BN I know (Pathﬁnder) had 109 variables! 2109 ≈ 1036How Not to Build BN’s Alarm Burglary Earthquake JohnCalls MaryCalls • With the wrong ordering of nodes, the network becomes more complicated, and requires more (and more difﬁcult) conditional probability assessments Alarm Burglary Earthquake JohnCalls MaryCalls AlarmBurglary Earthquake JohnCalls MaryCalls Order: M, J,A, B, E Order: M, J, E, B,ASimplifying Conditional Probability Tables • Do we know any structure in the way that Par(x) “cause” x? • If each destroyer can sink the ship with probability P(s|di), what is the probability that the ship will sink if it’s attacked by both? • For |Par(x)| = n, this requires O(n) parameters, not O(kn) Photo by Konabish on Flickr. Image by MIT OpenCourseWare.Image by MIT OpenCourseWare.Inference • Recall the two basic inference problems: Belief propagation & MAP explanation • Trivially, we can enumerate all “matching” rows of the joint probability distribution • For poly-trees (not even undirected loops—i.e., only one connection between any pair of nodes; like our Burglary example), there are efﬁcient linear algorithms, similar to constraint propagation • For arbitrary BN’s, all inference is NP-hard! • Exact solutions • ApproximationAlarm Burglary Earthquake JohnCalls MaryCalls Exact Solution of BN’s (Burglary example) • Notes: • Sum over all “don’t care” variables • Factor common terms out of summation • Calculation becomes a sum of products of sums of products ...Poly-trees are easy • Singly-connected structures allow propagation of observations via single paths • “Down” is just use of conditional probability • “Up” is just Bayes rule • Formulated as message propagation rules • Linear time (network diameter) • Fails on general networks! Alarm Burglary Earthquake JohnCalls MaryCalls TVshow Drunk Bottles EscapeExact Solution of BN’s (non-poly-trees) • What is the probability of a speciﬁc state, say A=t, B=f, C=t, D=t, E=f? • What is the probability that E=t given B=t? A B D C E • Consider the term P(e,b) Alas, optimal factoring is NP-hard • 12 instead of 32 multiplications (even in this small example)Other Exact Methods A B D C E A D B,C E • Join-tree: Merge variables into (small!) sets of variables to make graph into a poly-tree. Most commonly-used; aka Clustering, Junction-tree, Potential) • Cutset-conditioning: Instantiate a (small!) set of variables, then solve each residual problem, and add solutions weighted by probabilities of the instantiated variables having those values ...• • All these methods are essentially equivalent; with some time-space tradeoffs.Approximate Inference in BN’s • Direct Sampling—samples joint distribution • Rejection Sampling—computes P(X|e), uses ancestor evidence nodes in sampling • Likelihood Weighting—like Rejection Sampling, but weights by probability of descendant evidence nodes • Markov chain Monte Carlo • Gibbs and other similar sampling methodsDirect Sampling function Prior-Sample(bn) returns an event sampled from bn inputs: bn, a Bayes net specifying the joint distribution P(X1, ... Xn) x := an event with n elements for i = 1 to n do xi := a random sample from P(Xi|Par(Xi)) return x • From a large number of samples, we can estimate all joint probabilities • The probability of an event is the fraction of all complete events generated by PS that match the partially speciﬁed event • hence we can compute all conditionals, etc.Rejection Sampling function Rejection-Sample(X, e, bn, N) returns an estimate of P(X|e) inputs: bn, a Bayes net X, the query variable e,

View Full Document


School:
Email:
New Password:
Confirm Password:

MIT HST 950J - Lecture Notes

Sign up for free to view:

Please select your school