Learning Bayesian NetworksHarald SteckMIT AI [email protected] 3, 2002Outline• Bayesian networks (BN)• Learning discrete Bayesian Networks– Scoring Functions– Search Strategies• ApplicationsHow to obtain Bayesian Networks ?• construct them manually: experts / knowledge needed• learn them from data• combine prior knowledge and dataProperties of Bayesian Networks• qualitative:– graph structure visualizes relevant (in)dependencies amongrandom variables in a domain– interpretation in causal manner: requires add’l assumptions• quantitative: make predictions (inference)• Example (Visit to Asia):Bayesian Networks (more formal)• network structure: directed acyclic graph (DAG)CDAB– directed edge: asymmetric relations (but not necessarily causal)– missing edges represent conditional independences (d-separationcriterion)• parameters: conditional probabilitiesp(A, B, C, D) = p(D|C, BX, A) · p(C|B, AX) · p(B|A) · p(A)• BN describes probability distribution over n variables in a modularway:p(X) =nYi=1p(Xi|Πi)How to model conditional probability distributions ?• discrete variables (tables)• continuous variables:– multivariate Gaussian (linear regression)BAp(A): A ∼ N(µA, σ2A)p(B|A): B ∼ N(µB+ θA,B· A, σ2B)– nonlinear relations:∗ nonlinear regressionp(B|A): B ∼ N(µB+θA,B,1·A+θA,B,2·A2, σ2B)∗ other models: neural networks + noise, ...• both discrete and continuous variablesMarkov-Equivalence• applies to discrete BNs and continuous Gaussian BNs• Example:m1m0m2m3ABCCA BBACCA BA and B conditionally independent given CA and B independent A and B dependent given CA and B dependent (marginally)v−structurep(A, B, C) = p(A|C) p(C|B) p(B)| {z }m0= p(A|C) p(B|C) p(C)| {z }m1= p(C|A) p(B|C) p(A)| {z }m2• DAGs can be partitioned into equivalence classes that representthe same conditional independencesMarkov-Equivalence (cont’d)• Two DAGs are Markov-equivalent iff they have– the same edges when ignoring their orientations– and the same v-structures (&.)Example: 2 Markov-equivalent DAGsFGDEBACFEBACDGno !v−structureOutline• Bayesian networks• Learning discrete Bayesian Networks– Scoring Functions– Search Strategies• Applications• in the following assumed:– no hidden variables– no missing data– discrete BNs (blue=discrete)Scoring Functions• Maximum Likelihoodˆθxi|πi=Nxi,πiNπil(ˆθm) = log L(θm) =XiXxi,πiNxi,πilogNxi,πiNπinot useful for model selection: over-fittingScoring Functions (cont’d)• BIC (Bayesian Information Criterion, aka (Jeffreys-)Schwarz Cri-terion)fBIC(m) = l(ˆθm) −12|ˆθm| log N– trade-off between goodness of fit and model complexity– BIC coincides with MDL (Minimum Description Length)• AIC (Akaike Information Criterion)fAIC(m) = l(ˆθm) − |ˆθm|where the number of independent parameters is|ˆθm| =Xi(|Xi| − 1) · |Πi||{z}=QX∈Πi|X|Score Differencem+aπb................aπb................m-• compare two graphs m+and m−that differ in one edge only• Example: BIC for discrete variables∆f(m+, m−) = f(m+) − f(m−)=Xa,b,πNa,b,πlogNa,b,πNπNa,πNb,π−12dDFlog N= g(A, B|Π)... independent of remaining variableswhere dDFare the degrees of freedom:dDF= |θm+| − |θm−| = (|A| − 1) · (|B| − 1) · |Π||{z}=QX∈Π|X|Score Difference (cont’d)• Conditional Independences (which are represented by BNs):g(A, B|Π) < 0 ... absence of egde A ← B favored given Π... A independent of B given Πg(A, B|Π) > 0 ... presence of egde A ← B favored given Π... A dependent on B given Π• Markov equivalence:– data cannot help distinguish among Markov equivalent DAGs– a ”local” property of equivalent DAGs: an edge A ← B canbe reversed if ΠA\ {B} = ΠB\ {A}– for BIC: g(A, B|Π) = g(B, A|Π)– hence BIC assigns the same score to equivalent DAGsOutline• Bayesian networks• Learning discrete Bayesian Networks– Scoring Functions– Search Strategies• ApplicationsSearch Strategies• in discrete or continuous Gaussian BNs:– data can help distinguish only among equivalence classes– search in space of equivalence classes is thus most appropriate,but very involved• search in space of DAGs• number of DAGs with n variables: 2(n2)< #DAGs ≤ 3(n2)• finding optimal DAG w.r.t. a scoring function f is NP-hard• resort to approximate search strategiesLocal Search• general-purpose search strategy• choose a starting graph• proceed through search space along a sequence of neighboringDAGs, guided by scoring function• DAGs differing in a single edge may be defined as neighbors• hence, 3 possible transitions in local search:– add an edge (if permissible)– delete an edge– reverse the orientation of an edge (if permissible)• score difference due to transition: ∆ = f(mnew) − f(mold)Local Search and Greedy Hill Climbing• choose transition that maximizes ∆• repeat until ∆ < 0 for all permissible steps• result: graph that is a local optimum• Example:PECPIQSEXSESPECPIQSEXSESPECPIQSEXSESPECPIQSESSEXreversedelete addLocal Search and Simulated Annealing• general purpose optimization procedure to avoid local optima• inspired by cooling down an ensemble of particles (statisticalphysics)• temperature of system: T• procedure– start with high temperature and lower it slowly over time– randomly choose a transition– make transition with probability p(∆) = min{1, exp(∆/T )}• theory: finds global minimum of −f with probability 1 if startingtemperature is sufficiently high and is lowered sufficiently slowly• practice: limited computation time, may only find a local opti-mumOutline• Bayesian networks• Learning discrete Bayesian Networks– Scoring Functions– Search Strategies• ApplicationsAnalysis of Questionnaires• find relevant conditional dependencies• e.g., Wisconsin High-School Students (Sewell and Shah, 1968):– survey among 10,318 students– learn BN from that data:PECPIQSEXSESSEX: gender of studentCP: college plansPE: parental encouragementSES: socioeconomic statusIQ: intelligence quotientAnalysis of Noisy Measurements• e.g., gene expression data from bio-tech labs– graph: recovery of regulatory networks(Hartemink et al., 2002)– prediction: what is the most informative next experiment tobe conducted (active
View Full Document