Learning Bayesian Networks Harald Steck MIT AI Lab harald ai mit edu December 3 2002 Outline Bayesian networks BN Learning discrete Bayesian Networks Scoring Functions Search Strategies Applications How to obtain Bayesian Networks construct them manually experts knowledge needed learn them from data combine prior knowledge and data Properties of Bayesian Networks qualitative graph structure visualizes relevant in dependencies among random variables in a domain interpretation in causal manner requires add l assumptions quantitative make predictions inference Example Visit to Asia Bayesian Networks more formal network structure directed acyclic graph DAG B A C D directed edge asymmetric relations but not necessarily causal missing edges represent conditional independences d separation criterion parameters conditional probabilities p A B C D p D C B X A p C B A X p B A p A BN describes probability distribution over n variables in a modular way p X n Y i 1 p Xi i How to model conditional probability distributions discrete variables tables continuous variables multivariate Gaussian linear regression A B 2 p A A N A A 2 p B A B N B A B A B nonlinear relations nonlinear regression 2 p B A B N B A B 1 A A B 2 A2 B other models neural networks noise both discrete and continuous variables Markov Equivalence applies to discrete BNs and continuous Gaussian BNs Example A A B B A A B B v structure C C C C m0 m1 m2 m3 A and B dependent marginally A and B conditionally independent given C A and B independent A and B dependent given C p A B C p A C p C B p B p A C p B C p C p C A p B C p A z m0 z m1 z m2 DAGs can be partitioned into equivalence classes that represent the same conditional independences Markov Equivalence cont d Two DAGs are Markov equivalent iff they have the same edges when ignoring their orientations and the same v structures Example 2 Markov equivalent DAGs A C B no E D F A B G v structure C E D F G Outline Bayesian networks Learning discrete Bayesian Networks Scoring Functions Search Strategies Applications in the following assumed no hidden variables no missing data discrete BNs blue discrete Scoring Functions Maximum Likelihood Nxi i xi i N i Nxi i l m log L m Nxi i log N i i xi i X X not useful for model selection over fitting Scoring Functions cont d BIC Bayesian Information Criterion aka Jeffreys Schwarz Criterion 1 fBIC m l m m log N 2 trade off between goodness of fit and model complexity BIC coincides with MDL Minimum Description Length AIC Akaike Information Criterion fAIC m l m m where the number of independent parameters is m X i Xi 1 i z Q X X i Score Difference b a m b a m compare two graphs m and m that differ in one edge only Example BIC for discrete variables f m m f m f m X Na b N 1 Na b log dDF log N Na Nb 2 a b g A B independent of remaining variables where dDF are the degrees of freedom dDF m m A 1 B 1 Q z X X Score Difference cont d Conditional Independences which are represented by BNs g A B 0 absence of egde A B favored given A independent of B given g A B 0 presence of egde A B favored given A dependent on B given Markov equivalence data cannot help distinguish among Markov equivalent DAGs a local property of equivalent DAGs an edge A B can be reversed if A B B A for BIC g A B g B A hence BIC assigns the same score to equivalent DAGs Outline Bayesian networks Learning discrete Bayesian Networks Scoring Functions Search Strategies Applications Search Strategies in discrete or continuous Gaussian BNs data can help distinguish only among equivalence classes search in space of equivalence classes is thus most appropriate but very involved search in space of DAGs n n number of DAGs with n variables 2 2 DAGs 3 2 finding optimal DAG w r t a scoring function f is NP hard resort to approximate search strategies Local Search general purpose search strategy choose a starting graph proceed through search space along a sequence of neighboring DAGs guided by scoring function DAGs differing in a single edge may be defined as neighbors hence 3 possible transitions in local search add an edge if permissible delete an edge reverse the orientation of an edge if permissible score difference due to transition f mnew f mold Local Search and Greedy Hill Climbing choose transition that maximizes repeat until 0 for all permissible steps result graph that is a local optimum Example SEX SEX SEX SES SES PE SES PE IQ PE IQ CP delete SES PE IQ CP SEX IQ CP reverse CP add Local Search and Simulated Annealing general purpose optimization procedure to avoid local optima inspired by cooling down an ensemble of particles statistical physics temperature of system T procedure start with high temperature and lower it slowly over time randomly choose a transition make transition with probability p min 1 exp T theory finds global minimum of f with probability 1 if starting temperature is sufficiently high and is lowered sufficiently slowly practice limited computation time may only find a local optimum Outline Bayesian networks Learning discrete Bayesian Networks Scoring Functions Search Strategies Applications Analysis of Questionnaires find relevant conditional dependencies e g Wisconsin High School Students Sewell and Shah 1968 survey among 10 318 students learn BN from that data SEX SES PE IQ CP SES socioeconomic status SEX gender of student PE parental encouragement CP college plans IQ intelligence quotient Analysis of Noisy Measurements e g gene expression data from bio tech labs graph recovery of regulatory networks Hartemink et al 2002 prediction what is the most informative next experiment to be conducted active learning
View Full Document
Unlocking...