1 1 Structure Learning (The Good), The Bad, The Ugly A little inference too… Graphical Models – 10708 Carlos Guestrin Carnegie Mellon University October 8th, 2008 Readings: K&F: 17.3, 17.4, 17.5.1, 8.1, 12.1 10-708 – ©Carlos Guestrin 2006-2008 10-708 – ©Carlos Guestrin 2006-2008 2 Decomposable score Log data likelihood Decomposable score: Decomposes over families in BN (node and its parents) Will lead to significant computational efficiency!!! Score(G : D) = ∑i FamScore(Xi|PaXi : D)2 10-708 – ©Carlos Guestrin 2006-2008 3 Chow-Liu tree learning algorithm 1 For each pair of variables Xi,Xj Compute empirical distribution: Compute mutual information: Define a graph Nodes X1,…,Xn Edge (i,j) gets weight 10-708 – ©Carlos Guestrin 2006-2008 4 Maximum likelihood score overfits! Information never hurts: Adding a parent always increases score!!!3 10-708 – ©Carlos Guestrin 2006-2008 5 Bayesian score Prior distributions: Over structures Over parameters of a structure Posterior over structures given data: 10-708 – ©Carlos Guestrin 2006-2008 6 Bayesian learning for multinomial What if you have a k sided coin??? Likelihood function if multinomial: Conjugate prior for multinomial is Dirichlet: Observe m data points, mi from assignment i, posterior: Prediction:4 10-708 – ©Carlos Guestrin 2006-2008 7 Global parameter independence, d-separation and local prediction Flu Allergy Sinus Headache Nose Independencies in meta BN: Proposition: For fully observable data D, if prior satisfies global parameter independence, then 10-708 – ©Carlos Guestrin 2006-2008 8 Priors for BN CPTs (more when we talk about structure learning) Consider each CPT: P(X|U=u) Conjugate prior: Dirichlet(αX=1|U=u,…, αX=k|U=u) More intuitive: “prior data set” D’ with m’ equivalent sample size “prior counts”: prediction:5 10-708 – ©Carlos Guestrin 2006-2008 9 An example 10-708 – ©Carlos Guestrin 2006-2008 10 What you need to know about parameter learning Bayesian parameter learning: motivation for Bayesian approach Bayesian prediction conjugate priors, equivalent sample size Bayesian learning ) smoothing Bayesian learning for BN parameters Global parameter independence Decomposition of prediction according to CPTs Decomposition within a CPT6 10-708 – ©Carlos Guestrin 2006-2008 11 Bayesian score and model complexity X Y True model: P(Y=t|X=t) = 0.5 + α$P(Y=t|X=f) = 0.5 - α Structure 1: X and Y independent Score doesn’t depend on alpha Structure 2: X ! Y Data points split between P(Y=t|X=t) and P(Y=t|X=f) For fixed M, only worth it for large α$ Because posterior over parameter will be more diffuse with less data 10-708 – ©Carlos Guestrin 2006-2008 12 Bayesian, a decomposable score As with last lecture, assume: Parameter independence Also, prior satisfies parameter modularity: If Xi has same parents in G and G’, then parameters have same prior Finally, structure prior P(G) satisfies structure modularity Product of terms over families E.g., P(G) / c|G| Bayesian score decomposes along families!7 10-708 – ©Carlos Guestrin 2006-2008 13 BIC approximation of Bayesian score Bayesian has difficult integrals For Dirichlet prior, can use simple Bayes information criterion (BIC) approximation In the limit, we can forget prior! Theorem: for Dirichlet prior, and a BN with Dim(G) independent parameters, as m!1: 10-708 – ©Carlos Guestrin 2006-2008 14 BIC approximation, a decomposable score BIC: Using information theoretic formulation:8 10-708 – ©Carlos Guestrin 2006-2008 15 Consistency of BIC and Bayesian scores A scoring function is consistent if, for true model G*, as m!1, with probability 1 G* maximizes the score All structures not I-equivalent to G* have strictly lower score Theorem: BIC score is consistent Corollary: the Bayesian score is consistent What about maximum likelihood score? Consistency is limiting behavior, says nothing about finite sample size!!! 10-708 – ©Carlos Guestrin 2006-2008 16 Priors for general graphs For finite datasets, prior is important! Prior over structure satisfying prior modularity What about prior over parameters, how do we represent it? K2 prior: fix an α, P(θXi|PaXi) = Dirichlet(α,…, α) K2 is “inconsistent”9 10-708 – ©Carlos Guestrin 2006-2008 17 BDe prior Remember that Dirichlet parameters analogous to “fictitious samples” Pick a fictitious sample size m’ For each possible family, define a prior distribution P(Xi,PaXi) Represent with a BN Usually independent (product of marginals) BDe prior: Has “consistency property”: 10-708 – ©Carlos Guestrin 2006-2008 18 Score equivalence If G and G’ are I-equivalent then they have same score Theorem 1: Maximum likelihood score and BIC score satisfy score equivalence Theorem 2: If P(G) assigns same prior to I-equivalent structures (e.g., edge counting) and parameter prior is dirichlet then Bayesian score satisfies score equivalence if and only if prior over parameters represented as a BDe prior!!!!!!10 10-708 – ©Carlos Guestrin 2006-2008 19 Chow-Liu for Bayesian score Edge weight wXj!Xi is advantage of adding Xj as parent for Xi Now have a directed graph, need directed spanning forest Note that adding an edge can hurt Bayesian score – choose forest not tree Maximum spanning forest algorithm works 10-708 – ©Carlos Guestrin 2006-2008 20 Structure learning for general graphs In a tree, a node only has one parent Theorem: The problem of learning a BN structure with at most d parents is NP-hard for any (fixed) d≥2 Most structure learning approaches use heuristics Exploit score decomposition (Quickly) Describe two heuristics that exploit decomposition in different ways11 Announcements Recitation tomorrow Don’t miss it!!! 21 10-708 – ©Carlos Guestrin 2006-2008
View Full Document