Bayesian Networks –Representation (cont.)Inference Machine Learning – 10701/15781Carlos GuestrinCarnegie Mellon UniversityMarch 22st, 2006Required Readings from Koller & Friedman:Representation: 2.1, 2.2Inference: 5.1, 6.1, 6.2, 6.7.1Optional:2.3, 5.2, 5.3, 6.3, 6.7.2Announcements One page project proposal due now We’ll go over midterm in this week’s recitation Homework 4 out later today, due April 5th two weeks from todayHandwriting recognitionCharacter recognition, e.g., kernel SVMszcbcacrrrrrrHandwriting recognition 2Car starts BN 18 binary attributes Inference P(BatteryAge|Starts=f) 218terms, why so fast? Not impressed? HailFinder BN – more than 354= 58149737003040059690390169 termsFactored joint distribution -PreviewFluAllergySinusHeadacheNoseThe independence assumption FluAllergySinusHeadacheNoseLocal Markov Assumption:A variable X is independentof its non-descendants given its parentsExplaining awayFluAllergySinusHeadacheNoseLocal Markov Assumption:A variable X is independentof its non-descendants given its parentsChain rule & Joint distributionFluAllergySinusHeadacheNoseLocal Markov Assumption:A variable X is independentof its non-descendants given its parentsTwo (trivial) special casesEdgeless graph Fully-connected graphThe Representation Theorem –Joint Distribution to BNJoint probabilitydistribution:ObtainBN:Encodes independenceassumptionsIf conditionalindependenciesin BN are subset of conditional independencies in PReal Bayesian networks applications Diagnosis of lymph node disease Speech recognition Microsoft office and Windows http://www.research.microsoft.com/research/dtg/ Study Human genome Robot mapping Robots to identify meteorites to study Modeling fMRI data Anomaly detection Fault dianosis Modeling sensor network dataA general Bayes net Set of random variables Directed acyclic graph Encodes independence assumptions CPTs Joint distribution:Another example Variables: B – Burglar E – Earthquake A – Burglar alarm N – Neighbor calls R – Radio report Both burglars and earthquakes can set off the alarm If the alarm sounds, a neighbor may call An earthquake may be announced on the radioAnother example – Building the BN B – Burglar E – Earthquake A – Burglar alarm N – Neighbor calls R – Radio reportIndependencies encoded in BN We said: All you need is the local Markov assumption (Xi⊥ NonDescendantsXi| PaXi) But then we talked about other (in)dependencies e.g., explaining away What are the independencies encoded by a BN? Only assumption is local Markov But many others can be derived using the algebra of conditional independencies!!!Understanding independencies in BNs– BNs with 3 nodesZYXLocal Markov Assumption:A variable X is independentof its non-descendants given its parents Z YXZ YXZYXIndirect causal effect:Indirect evidential effect:Common cause:Common effect:Understanding independencies in BNs– Some examplesAHCEGDBFKJIAn active trail – ExampleA HCEGDBFF’’F’When are A and H independent?Active trails formalized A path X1 – X2 – · · · –Xkis an active trail when variables O⊆{X1,…,Xn} are observed if for each consecutive triplet in the trail: Xi-1→Xi→Xi+1, and Xiis not observed (Xi∉O) Xi-1←Xi←Xi+1, and Xiis not observed (Xi∉O) Xi-1←Xi→Xi+1, and Xiis not observed (Xi∉O) Xi-1→Xi←Xi+1, and Xiis observed (Xi∈O), or one of its descendentsActive trails and independence? Theorem: Variables Xiand Xjare independent given Z⊆{X1,…,Xn} if the is no active trail between Xiand Xjwhen variables Z⊆{X1,…,Xn} are observedAHCEGDBFKJIThe BN Representation TheoremIf joint probabilitydistribution:ObtainThen conditionalindependenciesin BN are subset of conditional independencies in PJoint probabilitydistribution:ObtainIf conditionalindependenciesin BN are subset of conditional independencies in PImportant because: Every P has at least one BN structure GImportant because: Read independencies of P from BN structure G“Simpler” BNs A distribution can be represented by many BNs: Simpler BN, requires fewer parametersLearning Bayes netsMissing dataFully observable dataUnknown structureKnown structurex(1)…x(m)Datastructure parametersCPTs –P(Xi| PaXi)Learning the CPTsx(1)…x(m)DataFor each discrete variable XiWhat you need to know Bayesian networks A compact representation for large probability distributions Not an algorithm Semantics of a BN Conditional independence assumptions Representation Variables Graph CPTs Why BNs are useful Learning CPTs from fully observable data Play with applet!!! ☺General probabilistic inference Query: Using Bayes rule: Normalization:FluAllergySinusHeadacheNoseMarginalizationFluAllergy=tSinusProbabilistic inference exampleFluAllergySinusHeadacheNose=tInference seems exponential in number of variables!Inference is NP-hard (Actually #P-complete)Reduction – 3-SAT...)()(432321∧∨∨∧∨∨ XXXXXXInference unlikely to be efficient in general, but…Fast probabilistic inference example – Variable eliminationFluAllergySinusHeadacheNose=t(Potential for) Exponential reduction in computation!Understanding variable elimination –Exploiting distributivityFlu Sinus Nose=tUnderstanding variable elimination –Order can make a HUGE differenceFluAllergySinusHeadacheNose=tUnderstanding variable elimination –Intermediate resultsFluAllergySinusHeadacheNose=tIntermediate results are probability distributionsUnderstanding variable elimination –Another examplePharmacySinusHeadacheNose=tPruning irrelevant variablesFluAllergySinusHeadacheNose=tPrune all non-ancestors of query variablesVariable elimination algorithm Given a BN and a query P(X|e) ∝ P(X,e) Instantiate evidence e Prune non-ancestors of {X,e} Choose an ordering on variables, e.g., X1, …, Xn For i = 1 to n, If Xi∉{X,e} Collect factors f1,…,fkthat include Xi Generate a new factor by eliminating Xifrom these factors Variable Xihas been eliminated! Normalize P(X,e) to obtain P(X|e)IMPORTANT!!!Complexity of variable elimination –(Poly)-tree graphsVariable elimination order:Start from “leaves” up –find topological order, eliminate variables in reverse orderLinear in number of variables!!! (versus exponential)Complexity of variable elimination –Graphs with
View Full Document