Structure Learning in BNs 2: (the good,) the bad, the uglyMaximum likelihood score overfits!Bayesian scoreBayesian score and model complexityBayesian, a decomposable scoreBIC approximation of Bayesian scoreBIC approximation, a decomposable scoreConsistency of BIC and Bayesian scoresPriors for general graphsBDe priorAnnouncementsScore equivalenceChow-Liu for Bayesian scoreStructure learning for general graphsUnderstanding score decompositionFixed variable order 1Fixed variable order 2Learn BN structure using local searchExploit score decomposition in local searchSome experimentsOrder search versus graph searchBayesian model averagingWhat you need to know about learning BN structuresInference in graphical models: Typical queries 1Inference in graphical models: Typical queries 2 – MaximizationAre MPE and MAP Consistent?Complexity of conditional probability queries 1Complexity of conditional probability queries 2Inference is #P-hard, hopeless?What about the maximization problems? First, most probable explanation (MPE)What about maximum a posteriori?Can we exploit structure for maximization?Exact inference is hard, what about approximate inference?Hardness of approximate inferenceWhat you need to know about inference1Structure Learning in BNs 2:(the good,) the bad, the ugly Graphical Models – 10708Carlos GuestrinCarnegie Mellon UniversityOctober 4th, 2006Readings:K&F: 15.1, 15.2, 15.3, 15.4, 15.510-708 – Carlos Guestrin 20062 Maximum likelihood score overfits!Information never hurts:Adding a parent always increases score!!!10-708 – Carlos Guestrin 20063 Bayesian scorePrior distributions:Over structuresOver parameters of a structurePosterior over structures given data:10-708 – Carlos Guestrin 20064 Bayesian score and model complexityXYTrue model:P(Y=t|X=t) = 0.5 + P(Y=t|X=f) = 0.5 - Structure 1: X and Y independentScore doesn’t depend on alphaStructure 2: X ! YData points split between P(Y=t|X=t) and P(Y=t|X=f)For fixed M, only worth it for large Because posterior over parameter will be more diffuse with less data10-708 – Carlos Guestrin 20065 Bayesian, a decomposable scoreAs with last lecture, assume:Local and global parameter independenceAlso, prior satisfies parameter modularity:If Xi has same parents in G and G’, then parameters have same priorFinally, structure prior P(G) satisfies structure modularityProduct of terms over familiesE.g., P(G) / c|G|Bayesian score decomposes along families!10-708 – Carlos Guestrin 20066 BIC approximation of Bayesian scoreBayesian has difficult integralsFor Dirichlet prior, can use simple Bayes information criterion (BIC) approximationIn the limit, we can forget prior!Theorem: for Dirichlet prior, and a BN with Dim(G) independent parameters, as m!1:10-708 – Carlos Guestrin 20067 BIC approximation, a decomposable scoreBIC:Using information theoretic formulation:10-708 – Carlos Guestrin 20068 Consistency of BIC and Bayesian scoresA scoring function is consistent if, for true model G*, as m!1, with probability 1G* maximizes the scoreAll structures not I-equivalent to G* have strictly lower scoreTheorem: BIC score is consistentCorollary: the Bayesian score is consistent What about maximum likelihood score?Consistency is limiting behavior, says nothing about finite sample size!!!10-708 – Carlos Guestrin 20069 Priors for general graphsFor finite datasets, prior is important!Prior over structure satisfying prior modularityWhat about prior over parameters, how do we represent it?K2 prior: fix an , P(Xi|PaXi) = Dirichlet(,…, ) K2 is “inconsistent”10-708 – Carlos Guestrin 200610 BDe priorRemember that Dirichlet parameters analogous to “fictitious samples”Pick a fictitious sample size m’For each possible family, define a prior distribution P(Xi,PaXi)Represent with a BNUsually independent (product of marginals)BDe prior: Has “consistency property”:10-708 – Carlos Guestrin 200611 AnnouncementsProject description is out on class website:Individual or groups of two onlySuggested projects on the class website, or do something related to your research (preferable) Must be something you started this semesterThe semester goes really quickly, so be realistic (and ambitious )Project deliverables:one page proposal due next week (10/11)5-page milestone report Nov. 1st Poster presentation on Dec. 1st, 3-6pmWrite up, 8-pages, due Dec. 8th All write ups in NIPS format (see class website), page limits are strictObjective:Explore and apply concepts in probabilistic graphical modelsDoing a fun project!10-708 – Carlos Guestrin 200612 Score equivalenceIf G and G’ are I-equivalent then they have same scoreTheorem 1: Maximum likelihood score and BIC score satisfy score equivalenceTheorem 2: If P(G) assigns same prior to I-equivalent structures (e.g., edge counting)and parameter prior is dirichletthen Bayesian score satisfies score equivalence if and only if prior over parameters represented as a BDe prior!!!!!!10-708 – Carlos Guestrin 200613 Chow-Liu for Bayesian scoreEdge weight wXj!Xi is advantage of adding Xj as parent for XiNow have a directed graph, need directed spanning forestNote that adding an edge can hurt Bayesian score – choose forest not treeBut, if score satisfies score equivalence, then wXj!Xi = wXj!Xi !Simple maximum spanning forest algorithm works10-708 – Carlos Guestrin 200614 Structure learning for general graphsIn a tree, a node only has one parentTheorem:The problem of learning a BN structure with at most d parents is NP-hard for any (fixed) d¸2Most structure learning approaches use heuristicsExploit score decomposition(Quickly) Describe two heuristics that exploit decomposition in different ways10-708 – Carlos Guestrin 200615 Understanding score decompositionDifficultySATGradeHappyJobCoherenceLetterIntelligence10-708 – Carlos Guestrin 200616 Fixed variable order 1Pick a variable order Áe.g., X1,…,XnXi can only pick parents in {X1,…,Xi-1}Any subsetAcyclicity guaranteed!Total score = sum score of each node10-708 – Carlos Guestrin 200617 Fixed variable order 2Fix max number of parents to kFor each i in order ÁPick PaXiµ{X1,…,Xi-1}Exhaustively search through all
View Full Document