1 1 Bayesian Param. Learning Bayesian Structure Learning Graphical Models – 10708 Carlos Guestrin Carnegie Mellon University October 6th, 2008 Readings: K&F: 16.3, 16.4, 17.3 10-708 – ©Carlos Guestrin 2006-2008 10-708 – ©Carlos Guestrin 2006-2008 2 Decomposable score Log data likelihood Decomposable score: Decomposes over families in BN (node and its parents) Will lead to significant computational efficiency!!! Score(G : D) = ∑i FamScore(Xi|PaXi : D)2 10-708 – ©Carlos Guestrin 2006-2008 3 Chow-Liu tree learning algorithm 1 For each pair of variables Xi,Xj Compute empirical distribution: Compute mutual information: Define a graph Nodes X1,…,Xn Edge (i,j) gets weight 10-708 – ©Carlos Guestrin 2006-2008 4 Chow-Liu tree learning algorithm 2 Optimal tree BN Compute maximum weight spanning tree Directions in BN: pick any node as root, breadth-first-search defines directions3 10-708 – ©Carlos Guestrin 2006-2008 5 Can we extend Chow-Liu 1 Tree augmented naïve Bayes (TAN) [Friedman et al. ’97] Naïve Bayes model overcounts, because correlation between features not considered Same as Chow-Liu, but score edges with: 10-708 – ©Carlos Guestrin 2006-2008 6 Can we extend Chow-Liu 2 (Approximately learning) models with tree-width up to k [Chechetka & Guestrin ’07] But, O(n2k+6)4 10-708 – ©Carlos Guestrin 2006-2008 7 What you need to know about learning BN structures so far Decomposable scores Maximum likelihood Information theoretic interpretation Best tree (Chow-Liu) Best TAN Nearly best k-treewidth (in O(N2k+6)) 10-708 – ©Carlos Guestrin 2006-2008 8 Maximum likelihood score overfits! Information never hurts: Adding a parent always increases score!!!5 10-708 – ©Carlos Guestrin 2006-2008 9 Bayesian score Prior distributions: Over structures Over parameters of a structure Posterior over structures given data: 10-708 – ©Carlos Guestrin 2006-2008 10 m Can we really trust MLE? What is better? 3 heads, 2 tails 30 heads, 20 tails 3x1023 heads, 2x1023 tails Many possible answers, we need distributions over possible parameters6 10-708 – ©Carlos Guestrin 2006-2008 11 Bayesian Learning Use Bayes rule: Or equivalently: 10-708 – ©Carlos Guestrin 2006-2008 12 Bayesian Learning for Thumbtack Likelihood function is simply Binomial: What about prior? Represent expert knowledge Simple posterior form Conjugate priors: Closed-form representation of posterior (more details soon) For Binomial, conjugate prior is Beta distribution7 10-708 – ©Carlos Guestrin 2006-2008 13 Beta prior distribution – P(θ) Likelihood function: Posterior: 10-708 – ©Carlos Guestrin 2006-2008 14 Posterior distribution Prior: Data: mH heads and mT tails Posterior distribution:8 10-708 – ©Carlos Guestrin 2006-2008 15 Conjugate prior Given likelihood function P(D|θ) (Parametric) prior of the form P(θ|α) is conjugate to likelihood function if posterior is of the same parametric family, and can be written as: P(θ|α’), for some new set of parameters α’ Prior: Data: mH heads and mT tails (binomial likelihood) Posterior distribution: 10-708 – ©Carlos Guestrin 2006-2008 16 Using Bayesian posterior Posterior distribution: Bayesian inference: No longer single parameter: Integral is often hard to compute9 10-708 – ©Carlos Guestrin 2006-2008 17 Bayesian prediction of a new coin flip Prior: Observed mH heads, mT tails, what is probability of m+1 flip is heads? 10-708 – ©Carlos Guestrin 2006-2008 18 Asymptotic behavior and equivalent sample size Beta prior equivalent to extra thumbtack flips: As m → 1, prior is “forgotten” But, for small sample size, prior is important! Equivalent sample size: Prior parameterized by αH,αT, or m’ (equivalent sample size) and α% Fix m’, change α%Fix α, change m’10 10-708 – ©Carlos Guestrin 2006-2008 19 Bayesian learning corresponds to smoothing m=0 ) prior parameter m!1 ) MLE m 10-708 – ©Carlos Guestrin 2006-2008 20 Bayesian learning for multinomial What if you have a k sided coin??? Likelihood function if multinomial: Conjugate prior for multinomial is Dirichlet: Observe m data points, mi from assignment i, posterior: Prediction:11 10-708 – ©Carlos Guestrin 2006-2008 21 Bayesian learning for two-node BN Parameters θX, θY|X Priors: P(θX): P(θY|X): 10-708 – ©Carlos Guestrin 2006-2008 22 Very important assumption on prior: Global parameter independence Global parameter independence: Prior over parameters is product of prior over CPTs12 10-708 – ©Carlos Guestrin 2006-2008 23 Global parameter independence, d-separation and local prediction Flu Allergy Sinus Headache Nose Independencies in meta BN: Proposition: For fully observable data D, if prior satisfies global parameter independence, then 10-708 – ©Carlos Guestrin 2006-2008 24 Within a CPT Meta BN including CPT parameters: Are θY|X=t and θY|X=f d-separated given D? Are θY|X=t and θY|X=f independent given D? Context-specific independence!!! Posterior decomposes:13 10-708 – ©Carlos Guestrin 2006-2008 25 Priors for BN CPTs (more when we talk about structure learning) Consider each CPT: P(X|U=u) Conjugate prior: Dirichlet(αX=1|U=u,…, αX=k|U=u) More intuitive: “prior data set” D’ with m’ equivalent sample size “prior counts”: prediction: 10-708 – ©Carlos Guestrin 2006-2008 26 An example14 10-708 – ©Carlos Guestrin 2006-2008 27 What you need to know about parameter learning Bayesian parameter learning: motivation for Bayesian approach Bayesian prediction conjugate priors, equivalent sample size Bayesian learning ) smoothing Bayesian learning for BN parameters Global parameter independence Decomposition of prediction according to CPTs Decomposition within a CPT
View Full Document