Unifying Variational and GBP Learning Parameters of MNs EM for BNsRevisiting Mean-FieldsInterpretation of energy functionalEntropy of a tree distributionLoopy BP & Bethe approximationGBP & Kikuchi approximationWhat you need to know about GBPAnnouncementsLearning Parameters of a BNLog Likelihood for MNLog Likelihood doesn’t decompose for MNsDerivative of Log Likelihood for MNsSlide 13Iterative Proportional Fitting (IPF)Log-linear Markov network (most common representation)Learning params for log linear models 1 – Generalized Iterative ScalingLearning params for log linear models 2 – Gradient AscentWhat you need to know about learning MN parameters?Thus far, fully supervised learningThe general learning problem with missing dataE-stepJensen’s inequalityApplying Jensen’s inequalityThe M-step maximizes lower bound on weighted dataThe M-stepConvergence of EMData likelihood for BNsMarginal likelihoodLog likelihood for BNs with hidden dataE-step for BNsThe M-step for BNsM-step for each CPTComputing expected countsData need not be hidden in the same wayWhat you need to know1Unifying Variational and GBPLearning Parameters of MNsEM for BNsGraphical Models – 10708Carlos GuestrinCarnegie Mellon UniversityNovember 15th, 2006Readings:K&F: 11.3, 11.5Yedidia et al. paper from the class websiteChapter 9 – JordanK&F: 16.210-708 – Carlos Guestrin 20062 Revisiting Mean-FieldsChoice of Q:Optimization problem:10-708 – Carlos Guestrin 20063 Interpretation of energy functionalEnergy functional:Exact if P=Q:View problem as an approximation of entropy term:10-708 – Carlos Guestrin 20064 Entropy of a tree distributionEntropy term:Joint distribution:Decomposing entropy term:More generally: di number neighbors of XiDifficultySATGradeHappyJobCoherenceLetterIntelligence10-708 – Carlos Guestrin 20065 Loopy BP & Bethe approximationEnergy functional:Bethe approximation of Free Energy:use entropy for trees, but loopy graphs:Theorem: If Loopy BP converges, resulting ij & i are stationary point (usually local maxima) of Bethe Free energy! DifficultySATGradeHappyJobCoherenceLetterIntelligence10-708 – Carlos Guestrin 20066 GBP & Kikuchi approximationExact Free energy: Junction TreeBethe Free energy:Kikuchi approximation: Generalized cluster graph spectrum from Bethe to exactentropy terms weighted by counting numberssee Yedidia et al.Theorem: If GBP converges, resulting Ci are stationary point (usually local maxima) of Kikuchi Free energy! DifficultySATGradeHappyJobCoherenceLetterIntelligenceDIGGJSLHGJCDGSI10-708 – Carlos Guestrin 20067 What you need to know about GBPSpectrum between Loopy BP & Junction Trees:More computation, but typically better answersIf satisfies RIP, equations are very simpleGeneral setting, slightly trickier equations, but not hardRelates to variational methods: Corresponds to local optima of approximate version of energy functional10-708 – Carlos Guestrin 20068 AnnouncementsTomorrow’s recitationKhalid on learning Markov Networks10-708 – Carlos Guestrin 20069 Learning Parameters of a BNLog likelihood decomposes:Learn each CPT independentlyUse countsDSGHJCLI10-708 – Carlos Guestrin 200610 Log Likelihood for MNLog likelihood of the data:DifficultySATGradeHappyJobCoherenceLetterIntelligence10-708 – Carlos Guestrin 200611 Log Likelihood doesn’t decompose for MNsLog likelihood:A concave problemCan find global optimum!!Term log Z doesn’t decompose!!DifficultySATGradeHappyJobCoherenceLetterIntelligence10-708 – Carlos Guestrin 200612 Derivative of Log Likelihood for MNsDifficultySATGradeHappyJobCoherenceLetterIntelligence10-708 – Carlos Guestrin 200613 Derivative of Log Likelihood for MNsDifficultySATGradeHappyJobCoherenceLetterIntelligenceDerivative:Setting derivative to zeroCan optimize using gradient ascentLet’s look at a simpler solution10-708 – Carlos Guestrin 200614 Iterative Proportional Fitting (IPF)DifficultySATGradeHappyJobCoherenceLetterIntelligenceSetting derivative to zero:Fixed point equation:Iterate and converge to optimal parametersEach iteration, must compute:10-708 – Carlos Guestrin 200615 Log-linear Markov network(most common representation)Feature is some function [D] for some subset of variables De.g., indicator functionLog-linear model over a Markov network H:a set of features 1[D1],…, k[Dk]each Di is a subset of a clique in Htwo ’s can be over the same variablesa set of weights w1,…,wkusually learned from data10-708 – Carlos Guestrin 200616 Learning params for log linear models 1 – Generalized Iterative ScalingIPF generalizes easily if: i(x) ¸ 0 i i(x) = 1Update rule:Must compute:If conditions violated, equations are not so simple…c.f., Improved Iterative Scaling [Berger ’97]10-708 – Carlos Guestrin 200617 Learning params for log linear models 2 – Gradient Ascent Log-likelihood of data:Compute derivative & optimizeusually with conjugate gradient ascentYou will do an example in your homework! 10-708 – Carlos Guestrin 200618 What you need to know about learning MN parameters?BN parameter learning easyMN parameter learning doesn’t decompose!Learning requires inference!Apply gradient ascent or IPF iterations to obtain optimal parametersapplies to both tabular representations and log-linear models10-708 – Carlos Guestrin 200619 Thus far, fully supervised learningWe have assumed fully supervised learning:Many real problems have missing data:10-708 – Carlos Guestrin 200620 The general learning problem with missing dataMarginal likelihood – x is observed, z is missing:10-708 – Carlos Guestrin 200621 E-stepx is observed, z is missingCompute probability of missing data given current choice of Q(z|xj) for each xj e.g., probability computed during classification stepcorresponds to “classification step” in K-means10-708 – Carlos Guestrin 200622 Jensen’s inequality Theorem: log z P(z) f(z) ¸ z P(z) log f(z)10-708 – Carlos Guestrin 200623 Applying Jensen’s inequalityUse: log z P(z) f(z) ¸ z P(z) log f(z)10-708 – Carlos Guestrin 200624 The M-step maximizes lower bound on weighted dataLower bound from Jensen’s:Corresponds to weighted
View Full Document