CMU CS 10708 - Unifying Variational and GBP Learning Parameters of MNs EM for BNs - D1866909

Home> Schools> Carnegie Mellon University> Computer Science (CS) > CS 10708> Unifying Variational and GBP Learning Parameters of MNs EM for BNs

DOC PREVIEW

CMU CS 10708 - Unifying Variational and GBP Learning Parameters of MNs EM for BNs

School name Carnegie Mellon University

Course Cs 10708- Probabilistic Graphical Models

Pages 18

This preview shows page 1-2-3-4-5-6 out of 18 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 18 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 18 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 18 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 18 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 18 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 18 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 18 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

11Unifying Variationaland GBPLearning Parameters of MNsEM for BNsGraphical Models – 10708Carlos GuestrinCarnegie Mellon UniversityNovember 15th, 2006Readings:K&F: 11.3, 11.5Yedidia et al. paper from the class websiteChapter 9, 20 – JordanK&F: 16.210-708 –Carlos Guestrin 20062Revisiting Mean-Fields Choice of Q: Optimization problem:210-708 –Carlos Guestrin 20063Interpretation of energy functional Energy functional: Exact if P=Q: View problem as an approximation of entropy term:10-708 –Carlos Guestrin 20064Entropy of a tree distribution Entropy term: Joint distribution: Decomposing entropy term: More generally:  dinumber neighbors of XiDifficultySATGradeHappyJobCoherenceLetterIntelligence310-708 –Carlos Guestrin 20065Loopy BP & Bethe approximation Energy functional: Bethe approximation of Free Energy: use entropy for trees, but loopy graphs: Theorem: If Loopy BP converges, resulting πij& πiare stationary point (usually local maxima) of Bethe Free energy! DifficultySATGradeHappyJobCoherenceLetterIntelligence10-708 –Carlos Guestrin 20066GBP & Kikuchi approximation Exact Free energy: Junction Tree Bethe Free energy: Kikuchi approximation: Generalized cluster graph  spectrum from Bethe to exact entropy terms weighted by counting numbers see Yedidia et al. Theorem: If GBP converges, resulting πCiare stationary point (usually local maxima) of Kikuchi Free energy! DifficultySATGradeHappyJobCoherenceLetterIntelligenceDIGGJSLHGJCDGSI410-708 –Carlos Guestrin 20067What you need to know about GBP Spectrum between Loopy BP & Junction Trees: More computation, but typically better answers If satisfies RIP, equations are very simple General setting, slightly trickier equations, but not hard Relates to variational methods: Corresponds to local optima of approximate version of energy functional 10-708 –Carlos Guestrin 20068Announcements Tomorrow’s recitation Khalid on learning Markov Networks510-708 –Carlos Guestrin 20069Learning Parameters of a BN Log likelihood decomposes: Learn each CPT independently Use countsDSGHJCLI10-708 –Carlos Guestrin 200610Log Likelihood for MN Log likelihood of the data:DifficultySATGradeHappyJobCoherenceLetterIntelligence610-708 –Carlos Guestrin 200611Log Likelihood doesn’t decompose for MNs Log likelihood: A concave problem Can find global optimum!! Term log Z doesn’t decompose!!DifficultySATGradeHappyJobCoherenceLetterIntelligence10-708 –Carlos Guestrin 200612Derivative of Log Likelihood for MNsDifficultySATGradeHappyJobCoherenceLetterIntelligence710-708 –Carlos Guestrin 200613Derivative of Log Likelihood for MNsDifficultySATGradeHappyJobCoherenceLetterIntelligence Derivative: Setting derivative to zero Can optimize using gradient ascent Let’s look at a simpler solution10-708 –Carlos Guestrin 200614Iterative Proportional Fitting (IPF)DifficultySATGradeHappyJobCoherenceLetterIntelligence Setting derivative to zero: Fixed point equation: Iterate and converge to optimal parameters Each iteration, must compute:810-708 –Carlos Guestrin 200615Log-linear Markov network(most common representation) Feature is some function φ[D] for some subset of variables D e.g., indicator function Log-linear model over a Markov network H: a set of features φ1[D1],…, φk[Dk] each Diis a subset of a clique in H two φ’s can be over the same variables a set of weights w1,…,wk usually learned from data10-708 –Carlos Guestrin 200616Learning params for log linear models 1 –Generalized Iterative Scaling IPF generalizes easily if: φi(x) ≥ 0 ∑iφi(x) = 1 Update rule: Must compute: If conditions violated, equations are not so simple… c.f., Improved Iterative Scaling [Berger ’97]910-708 –Carlos Guestrin 200617Learning params for log linear models 2 –Gradient Ascent  Log-likelihood of data: Compute derivative & optimize usually with conjugate gradient ascent You will do an example in your homework! ☺10-708 –Carlos Guestrin 200618What you need to know about learning MN parameters? BN parameter learning easy MN parameter learning doesn’t decompose! Learning requires inference! Apply gradient ascent or IPF iterations to obtain optimal parameters applies to both tabular representations and log-linear models1010-708 –Carlos Guestrin 200619Thus far, fully supervised learning We have assumed fully supervised learning: Many real problems have missing data:10-708 –Carlos Guestrin 200620The general learning problem with missing data Marginal likelihood – x is observed, z is missing:1110-708 –Carlos Guestrin 200621E-step x is observed, z is missing Compute probability of missing data given current choice of θ Q(z|xj) for each xj e.g., probability computed during classification step corresponds to “classification step” in K-means10-708 –Carlos Guestrin 200622Jensen’s inequality  Theorem: log ∑zP(z) f(z) ≥ ∑zP(z) log f(z)1210-708 –Carlos Guestrin 200623Applying Jensen’s inequality Use: log ∑zP(z) f(z) ≥ ∑zP(z) log f(z) 10-708 –Carlos Guestrin 200624The M-step maximizes lower bound on weighted data Lower bound from Jensen’s: Corresponds to weighted dataset: <x1,z=1> with weight Q(t+1)(z=1|x1) <x1,z=2> with weight Q(t+1)(z=2|x1) <x1,z=3> with weight Q(t+1)(z=3|x1) <x2,z=1> with weight Q(t+1)(z=1|x2) <x2,z=2> with weight Q(t+1)(z=2|x2) <x2,z=3> with weight Q(t+1)(z=3|x2) …1310-708 –Carlos Guestrin 200625The M-step Maximization step: Use expected counts instead of counts: If learning requires Count(x,z) Use EQ(t+1)[Count(x,z)]10-708 –Carlos Guestrin 200626Convergence of EM Define potential function F(θ,Q): EM corresponds to coordinate ascent on F Thus, maximizes lower bound on marginal log likelihood As seen in Machine Learning class last semester1410-708 –Carlos Guestrin 200627Data likelihood for BNs Given structure, log likelihood of fully observed data:FluAllergySinusHeadacheNose10-708 –Carlos Guestrin 200628Marginal likelihood What if S is hidden?FluAllergySinusHeadacheNose1510-708 –Carlos Guestrin 200629Log likelihood for BNswith hidden data Marginal likelihood – O is observed, H is

View Full Document