11Unifying Variationaland GBPLearning Parameters of MNsEM for BNsGraphical Models – 10708Carlos GuestrinCarnegie Mellon UniversityNovember 15th, 2006Readings:K&F: 11.3, 11.5Yedidia et al. paper from the class websiteChapter 9, 20 – JordanK&F: 16.210-708 –Carlos Guestrin 20062Revisiting Mean-Fields Choice of Q: Optimization problem:210-708 –Carlos Guestrin 20063Interpretation of energy functional Energy functional: Exact if P=Q: View problem as an approximation of entropy term:10-708 –Carlos Guestrin 20064Entropy of a tree distribution Entropy term: Joint distribution: Decomposing entropy term: More generally: dinumber neighbors of XiDifficultySATGradeHappyJobCoherenceLetterIntelligence310-708 –Carlos Guestrin 20065Loopy BP & Bethe approximation Energy functional: Bethe approximation of Free Energy: use entropy for trees, but loopy graphs: Theorem: If Loopy BP converges, resulting πij& πiare stationary point (usually local maxima) of Bethe Free energy! DifficultySATGradeHappyJobCoherenceLetterIntelligence10-708 –Carlos Guestrin 20066GBP & Kikuchi approximation Exact Free energy: Junction Tree Bethe Free energy: Kikuchi approximation: Generalized cluster graph spectrum from Bethe to exact entropy terms weighted by counting numbers see Yedidia et al. Theorem: If GBP converges, resulting πCiare stationary point (usually local maxima) of Kikuchi Free energy! DifficultySATGradeHappyJobCoherenceLetterIntelligenceDIGGJSLHGJCDGSI410-708 –Carlos Guestrin 20067What you need to know about GBP Spectrum between Loopy BP & Junction Trees: More computation, but typically better answers If satisfies RIP, equations are very simple General setting, slightly trickier equations, but not hard Relates to variational methods: Corresponds to local optima of approximate version of energy functional 10-708 –Carlos Guestrin 20068Announcements Tomorrow’s recitation Khalid on learning Markov Networks510-708 –Carlos Guestrin 20069Learning Parameters of a BN Log likelihood decomposes: Learn each CPT independently Use countsDSGHJCLI10-708 –Carlos Guestrin 200610Log Likelihood for MN Log likelihood of the data:DifficultySATGradeHappyJobCoherenceLetterIntelligence610-708 –Carlos Guestrin 200611Log Likelihood doesn’t decompose for MNs Log likelihood: A concave problem Can find global optimum!! Term log Z doesn’t decompose!!DifficultySATGradeHappyJobCoherenceLetterIntelligence10-708 –Carlos Guestrin 200612Derivative of Log Likelihood for MNsDifficultySATGradeHappyJobCoherenceLetterIntelligence710-708 –Carlos Guestrin 200613Derivative of Log Likelihood for MNsDifficultySATGradeHappyJobCoherenceLetterIntelligence Derivative: Setting derivative to zero Can optimize using gradient ascent Let’s look at a simpler solution10-708 –Carlos Guestrin 200614Iterative Proportional Fitting (IPF)DifficultySATGradeHappyJobCoherenceLetterIntelligence Setting derivative to zero: Fixed point equation: Iterate and converge to optimal parameters Each iteration, must compute:810-708 –Carlos Guestrin 200615Log-linear Markov network(most common representation) Feature is some function φ[D] for some subset of variables D e.g., indicator function Log-linear model over a Markov network H: a set of features φ1[D1],…, φk[Dk] each Diis a subset of a clique in H two φ’s can be over the same variables a set of weights w1,…,wk usually learned from data10-708 –Carlos Guestrin 200616Learning params for log linear models 1 –Generalized Iterative Scaling IPF generalizes easily if: φi(x) ≥ 0 ∑iφi(x) = 1 Update rule: Must compute: If conditions violated, equations are not so simple… c.f., Improved Iterative Scaling [Berger ’97]910-708 –Carlos Guestrin 200617Learning params for log linear models 2 –Gradient Ascent Log-likelihood of data: Compute derivative & optimize usually with conjugate gradient ascent You will do an example in your homework! ☺10-708 –Carlos Guestrin 200618What you need to know about learning MN parameters? BN parameter learning easy MN parameter learning doesn’t decompose! Learning requires inference! Apply gradient ascent or IPF iterations to obtain optimal parameters applies to both tabular representations and log-linear models1010-708 –Carlos Guestrin 200619Thus far, fully supervised learning We have assumed fully supervised learning: Many real problems have missing data:10-708 –Carlos Guestrin 200620The general learning problem with missing data Marginal likelihood – x is observed, z is missing:1110-708 –Carlos Guestrin 200621E-step x is observed, z is missing Compute probability of missing data given current choice of θ Q(z|xj) for each xj e.g., probability computed during classification step corresponds to “classification step” in K-means10-708 –Carlos Guestrin 200622Jensen’s inequality Theorem: log ∑zP(z) f(z) ≥ ∑zP(z) log f(z)1210-708 –Carlos Guestrin 200623Applying Jensen’s inequality Use: log ∑zP(z) f(z) ≥ ∑zP(z) log f(z) 10-708 –Carlos Guestrin 200624The M-step maximizes lower bound on weighted data Lower bound from Jensen’s: Corresponds to weighted dataset: <x1,z=1> with weight Q(t+1)(z=1|x1) <x1,z=2> with weight Q(t+1)(z=2|x1) <x1,z=3> with weight Q(t+1)(z=3|x1) <x2,z=1> with weight Q(t+1)(z=1|x2) <x2,z=2> with weight Q(t+1)(z=2|x2) <x2,z=3> with weight Q(t+1)(z=3|x2) …1310-708 –Carlos Guestrin 200625The M-step Maximization step: Use expected counts instead of counts: If learning requires Count(x,z) Use EQ(t+1)[Count(x,z)]10-708 –Carlos Guestrin 200626Convergence of EM Define potential function F(θ,Q): EM corresponds to coordinate ascent on F Thus, maximizes lower bound on marginal log likelihood As seen in Machine Learning class last semester1410-708 –Carlos Guestrin 200627Data likelihood for BNs Given structure, log likelihood of fully observed data:FluAllergySinusHeadacheNose10-708 –Carlos Guestrin 200628Marginal likelihood What if S is hidden?FluAllergySinusHeadacheNose1510-708 –Carlos Guestrin 200629Log likelihood for BNswith hidden data Marginal likelihood – O is observed, H is
View Full Document