MIT 6 867 - Learning Bayesian Networks - D1883262

Home> Schools> Massachusetts Institute of Technology> Electrical Engineering and Computer Science (6) > 6 867> Learning Bayesian Networks

DOC PREVIEW

MIT 6 867 - Learning Bayesian Networks

School name Massachusetts Institute of Technology

Course 6 867- Machine Learning

Pages 21

This preview shows page 1-2-20-21 out of 21 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 21 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 21 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 21 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 21 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 21 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Learning Bayesian NetworksHarald SteckMIT AI [email protected] 3, 2002Outline• Bayesian networks (BN)• Learning discrete Bayesian Networks– Scoring Functions– Search Strategies• ApplicationsHow to obtain Bayesian Networks ?• construct them manually: experts / knowledge needed• learn them from data• combine prior knowledge and dataProperties of Bayesian Networks• qualitative:– graph structure visualizes relevant (in)dependencies amongrandom variables in a domain– interpretation in causal manner: requires add’l assumptions• quantitative: make predictions (inference)• Example (Visit to Asia):Bayesian Networks (more formal)• network structure: directed acyclic graph (DAG)CDAB– directed edge: asymmetric relations (but not necessarily causal)– missing edges represent conditional independences (d-separationcriterion)• parameters: conditional probabilitiesp(A, B, C, D) = p(D|C, BX, A) · p(C|B, AX) · p(B|A) · p(A)• BN describes probability distribution over n variables in a modularway:p(X) =nYi=1p(Xi|Πi)How to model conditional probability distributions ?• discrete variables (tables)• continuous variables:– multivariate Gaussian (linear regression)BAp(A): A ∼ N(µA, σ2A)p(B|A): B ∼ N(µB+ θA,B· A, σ2B)– nonlinear relations:∗ nonlinear regressionp(B|A): B ∼ N(µB+θA,B,1·A+θA,B,2·A2, σ2B)∗ other models: neural networks + noise, ...• both discrete and continuous variablesMarkov-Equivalence• applies to discrete BNs and continuous Gaussian BNs• Example:m1m0m2m3ABCCA BBACCA BA and B conditionally independent given CA and B independent A and B dependent given CA and B dependent (marginally)v−structurep(A, B, C) = p(A|C) p(C|B) p(B)| {z }m0= p(A|C) p(B|C) p(C)| {z }m1= p(C|A) p(B|C) p(A)| {z }m2• DAGs can be partitioned into equivalence classes that representthe same conditional independencesMarkov-Equivalence (cont’d)• Two DAGs are Markov-equivalent iff they have– the same edges when ignoring their orientations– and the same v-structures (&.)Example: 2 Markov-equivalent DAGsFGDEBACFEBACDGno !v−structureOutline• Bayesian networks• Learning discrete Bayesian Networks– Scoring Functions– Search Strategies• Applications• in the following assumed:– no hidden variables– no missing data– discrete BNs (blue=discrete)Scoring Functions• Maximum Likelihoodˆθxi|πi=Nxi,πiNπil(ˆθm) = log L(θm) =XiXxi,πiNxi,πilogNxi,πiNπinot useful for model selection: over-fittingScoring Functions (cont’d)• BIC (Bayesian Information Criterion, aka (Jeffreys-)Schwarz Cri-terion)fBIC(m) = l(ˆθm) −12|ˆθm| log N– trade-off between goodness of fit and model complexity– BIC coincides with MDL (Minimum Description Length)• AIC (Akaike Information Criterion)fAIC(m) = l(ˆθm) − |ˆθm|where the number of independent parameters is|ˆθm| =Xi(|Xi| − 1) · |Πi||{z}=QX∈Πi|X|Score Differencem+aπb................aπb................m-• compare two graphs m+and m−that differ in one edge only• Example: BIC for discrete variables∆f(m+, m−) = f(m+) − f(m−)=Xa,b,πNa,b,πlogNa,b,πNπNa,πNb,π−12dDFlog N= g(A, B|Π)... independent of remaining variableswhere dDFare the degrees of freedom:dDF= |θm+| − |θm−| = (|A| − 1) · (|B| − 1) · |Π||{z}=QX∈Π|X|Score Difference (cont’d)• Conditional Independences (which are represented by BNs):g(A, B|Π) < 0 ... absence of egde A ← B favored given Π... A independent of B given Πg(A, B|Π) > 0 ... presence of egde A ← B favored given Π... A dependent on B given Π• Markov equivalence:– data cannot help distinguish among Markov equivalent DAGs– a ”local” property of equivalent DAGs: an edge A ← B canbe reversed if ΠA\ {B} = ΠB\ {A}– for BIC: g(A, B|Π) = g(B, A|Π)– hence BIC assigns the same score to equivalent DAGsOutline• Bayesian networks• Learning discrete Bayesian Networks– Scoring Functions– Search Strategies• ApplicationsSearch Strategies• in discrete or continuous Gaussian BNs:– data can help distinguish only among equivalence classes– search in space of equivalence classes is thus most appropriate,but very involved• search in space of DAGs• number of DAGs with n variables: 2(n2)< #DAGs ≤ 3(n2)• finding optimal DAG w.r.t. a scoring function f is NP-hard• resort to approximate search strategiesLocal Search• general-purpose search strategy• choose a starting graph• proceed through search space along a sequence of neighboringDAGs, guided by scoring function• DAGs differing in a single edge may be defined as neighbors• hence, 3 possible transitions in local search:– add an edge (if permissible)– delete an edge– reverse the orientation of an edge (if permissible)• score difference due to transition: ∆ = f(mnew) − f(mold)Local Search and Greedy Hill Climbing• choose transition that maximizes ∆• repeat until ∆ < 0 for all permissible steps• result: graph that is a local optimum• Example:PECPIQSEXSESPECPIQSEXSESPECPIQSEXSESPECPIQSESSEXreversedelete addLocal Search and Simulated Annealing• general purpose optimization procedure to avoid local optima• inspired by cooling down an ensemble of particles (statisticalphysics)• temperature of system: T• procedure– start with high temperature and lower it slowly over time– randomly choose a transition– make transition with probability p(∆) = min{1, exp(∆/T )}• theory: finds global minimum of −f with probability 1 if startingtemperature is sufficiently high and is lowered sufficiently slowly• practice: limited computation time, may only find a local opti-mumOutline• Bayesian networks• Learning discrete Bayesian Networks– Scoring Functions– Search Strategies• ApplicationsAnalysis of Questionnaires• find relevant conditional dependencies• e.g., Wisconsin High-School Students (Sewell and Shah, 1968):– survey among 10,318 students– learn BN from that data:PECPIQSEXSESSEX: gender of studentCP: college plansPE: parental encouragementSES: socioeconomic statusIQ: intelligence quotientAnalysis of Noisy Measurements• e.g., gene expression data from bio-tech labs– graph: recovery of regulatory networks(Hartemink et al., 2002)– prediction: what is the most informative next experiment tobe conducted (active

View Full Document


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2-20-21 out of 21 pages.

MIT 6 867 - Learning Bayesian Networks

Sign up for free to view:

Please select your school