DOC PREVIEW
MIT 6 867 - Learning Bayesian Networks

This preview shows page 1-2-20-21 out of 21 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 21 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 21 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 21 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 21 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 21 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Learning Bayesian NetworksHarald SteckMIT AI [email protected] 3, 2002Outline• Bayesian networks (BN)• Learning discrete Bayesian Networks– Scoring Functions– Search Strategies• ApplicationsHow to obtain Bayesian Networks ?• construct them manually: experts / knowledge needed• learn them from data• combine prior knowledge and dataProperties of Bayesian Networks• qualitative:– graph structure visualizes relevant (in)dependencies amongrandom variables in a domain– interpretation in causal manner: requires add’l assumptions• quantitative: make predictions (inference)• Example (Visit to Asia):Bayesian Networks (more formal)• network structure: directed acyclic graph (DAG)CDAB– directed edge: asymmetric relations (but not necessarily causal)– missing edges represent conditional independences (d-separationcriterion)• parameters: conditional probabilitiesp(A, B, C, D) = p(D|C, BX, A) · p(C|B, AX) · p(B|A) · p(A)• BN describes probability distribution over n variables in a modularway:p(X) =nYi=1p(Xi|Πi)How to model conditional probability distributions ?• discrete variables (tables)• continuous variables:– multivariate Gaussian (linear regression)BAp(A): A ∼ N(µA, σ2A)p(B|A): B ∼ N(µB+ θA,B· A, σ2B)– nonlinear relations:∗ nonlinear regressionp(B|A): B ∼ N(µB+θA,B,1·A+θA,B,2·A2, σ2B)∗ other models: neural networks + noise, ...• both discrete and continuous variablesMarkov-Equivalence• applies to discrete BNs and continuous Gaussian BNs• Example:m1m0m2m3ABCCA BBACCA BA and B conditionally independent given CA and B independent A and B dependent given CA and B dependent (marginally)v−structurep(A, B, C) = p(A|C) p(C|B) p(B)| {z }m0= p(A|C) p(B|C) p(C)| {z }m1= p(C|A) p(B|C) p(A)| {z }m2• DAGs can be partitioned into equivalence classes that representthe same conditional independencesMarkov-Equivalence (cont’d)• Two DAGs are Markov-equivalent iff they have– the same edges when ignoring their orientations– and the same v-structures (&.)Example: 2 Markov-equivalent DAGsFGDEBACFEBACDGno !v−structureOutline• Bayesian networks• Learning discrete Bayesian Networks– Scoring Functions– Search Strategies• Applications• in the following assumed:– no hidden variables– no missing data– discrete BNs (blue=discrete)Scoring Functions• Maximum Likelihoodˆθxi|πi=Nxi,πiNπil(ˆθm) = log L(θm) =XiXxi,πiNxi,πilogNxi,πiNπinot useful for model selection: over-fittingScoring Functions (cont’d)• BIC (Bayesian Information Criterion, aka (Jeffreys-)Schwarz Cri-terion)fBIC(m) = l(ˆθm) −12|ˆθm| log N– trade-off between goodness of fit and model complexity– BIC coincides with MDL (Minimum Description Length)• AIC (Akaike Information Criterion)fAIC(m) = l(ˆθm) − |ˆθm|where the number of independent parameters is|ˆθm| =Xi(|Xi| − 1) · |Πi||{z}=QX∈Πi|X|Score Differencem+aπb................aπb................m-• compare two graphs m+and m−that differ in one edge only• Example: BIC for discrete variables∆f(m+, m−) = f(m+) − f(m−)=Xa,b,πNa,b,πlogNa,b,πNπNa,πNb,π−12dDFlog N= g(A, B|Π)... independent of remaining variableswhere dDFare the degrees of freedom:dDF= |θm+| − |θm−| = (|A| − 1) · (|B| − 1) · |Π||{z}=QX∈Π|X|Score Difference (cont’d)• Conditional Independences (which are represented by BNs):g(A, B|Π) < 0 ... absence of egde A ← B favored given Π... A independent of B given Πg(A, B|Π) > 0 ... presence of egde A ← B favored given Π... A dependent on B given Π• Markov equivalence:– data cannot help distinguish among Markov equivalent DAGs– a ”local” property of equivalent DAGs: an edge A ← B canbe reversed if ΠA\ {B} = ΠB\ {A}– for BIC: g(A, B|Π) = g(B, A|Π)– hence BIC assigns the same score to equivalent DAGsOutline• Bayesian networks• Learning discrete Bayesian Networks– Scoring Functions– Search Strategies• ApplicationsSearch Strategies• in discrete or continuous Gaussian BNs:– data can help distinguish only among equivalence classes– search in space of equivalence classes is thus most appropriate,but very involved• search in space of DAGs• number of DAGs with n variables: 2(n2)< #DAGs ≤ 3(n2)• finding optimal DAG w.r.t. a scoring function f is NP-hard• resort to approximate search strategiesLocal Search• general-purpose search strategy• choose a starting graph• proceed through search space along a sequence of neighboringDAGs, guided by scoring function• DAGs differing in a single edge may be defined as neighbors• hence, 3 possible transitions in local search:– add an edge (if permissible)– delete an edge– reverse the orientation of an edge (if permissible)• score difference due to transition: ∆ = f(mnew) − f(mold)Local Search and Greedy Hill Climbing• choose transition that maximizes ∆• repeat until ∆ < 0 for all permissible steps• result: graph that is a local optimum• Example:PECPIQSEXSESPECPIQSEXSESPECPIQSEXSESPECPIQSESSEXreversedelete addLocal Search and Simulated Annealing• general purpose optimization procedure to avoid local optima• inspired by cooling down an ensemble of particles (statisticalphysics)• temperature of system: T• procedure– start with high temperature and lower it slowly over time– randomly choose a transition– make transition with probability p(∆) = min{1, exp(∆/T )}• theory: finds global minimum of −f with probability 1 if startingtemperature is sufficiently high and is lowered sufficiently slowly• practice: limited computation time, may only find a local opti-mumOutline• Bayesian networks• Learning discrete Bayesian Networks– Scoring Functions– Search Strategies• ApplicationsAnalysis of Questionnaires• find relevant conditional dependencies• e.g., Wisconsin High-School Students (Sewell and Shah, 1968):– survey among 10,318 students– learn BN from that data:PECPIQSEXSESSEX: gender of studentCP: college plansPE: parental encouragementSES: socioeconomic statusIQ: intelligence quotientAnalysis of Noisy Measurements• e.g., gene expression data from bio-tech labs– graph: recovery of regulatory networks(Hartemink et al., 2002)– prediction: what is the most informative next experiment tobe conducted (active


View Full Document

MIT 6 867 - Learning Bayesian Networks

Download Learning Bayesian Networks
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Learning Bayesian Networks and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Learning Bayesian Networks 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?