CMU CS 10708 - Unifying Variational and GBP Learning Parameters of MNs EM for BNs - D2836953

Home> Schools> Carnegie Mellon University> Computer Science (CS) > CS 10708> Unifying Variational and GBP Learning Parameters of MNs EM for BNs

DOC PREVIEW

CMU CS 10708 - Unifying Variational and GBP Learning Parameters of MNs EM for BNs

School name Carnegie Mellon University

Course Cs 10708- Probabilistic Graphical Models

Pages 35

This preview shows page 1-2-16-17-18-34-35 out of 35 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 35 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 35 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 35 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 35 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 35 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 35 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 35 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 35 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Unifying Variational and GBP Learning Parameters of MNs EM for BNsRevisiting Mean-FieldsInterpretation of energy functionalEntropy of a tree distributionLoopy BP & Bethe approximationGBP & Kikuchi approximationWhat you need to know about GBPAnnouncementsLearning Parameters of a BNLog Likelihood for MNLog Likelihood doesn’t decompose for MNsDerivative of Log Likelihood for MNsSlide 13Iterative Proportional Fitting (IPF)Log-linear Markov network (most common representation)Learning params for log linear models 1 – Generalized Iterative ScalingLearning params for log linear models 2 – Gradient AscentWhat you need to know about learning MN parameters?Thus far, fully supervised learningThe general learning problem with missing dataE-stepJensen’s inequalityApplying Jensen’s inequalityThe M-step maximizes lower bound on weighted dataThe M-stepConvergence of EMData likelihood for BNsMarginal likelihoodLog likelihood for BNs with hidden dataE-step for BNsThe M-step for BNsM-step for each CPTComputing expected countsData need not be hidden in the same wayWhat you need to know1Unifying Variational and GBPLearning Parameters of MNsEM for BNsGraphical Models – 10708Carlos GuestrinCarnegie Mellon UniversityNovember 15th, 2006Readings:K&F: 11.3, 11.5Yedidia et al. paper from the class websiteChapter 9 – JordanK&F: 16.210-708 – Carlos Guestrin 20062 Revisiting Mean-FieldsChoice of Q:Optimization problem:10-708 – Carlos Guestrin 20063 Interpretation of energy functionalEnergy functional:Exact if P=Q:View problem as an approximation of entropy term:10-708 – Carlos Guestrin 20064 Entropy of a tree distributionEntropy term:Joint distribution:Decomposing entropy term:More generally: di number neighbors of XiDifficultySATGradeHappyJobCoherenceLetterIntelligence10-708 – Carlos Guestrin 20065 Loopy BP & Bethe approximationEnergy functional:Bethe approximation of Free Energy:use entropy for trees, but loopy graphs:Theorem: If Loopy BP converges, resulting ij & i are stationary point (usually local maxima) of Bethe Free energy! DifficultySATGradeHappyJobCoherenceLetterIntelligence10-708 – Carlos Guestrin 20066 GBP & Kikuchi approximationExact Free energy: Junction TreeBethe Free energy:Kikuchi approximation: Generalized cluster graph spectrum from Bethe to exactentropy terms weighted by counting numberssee Yedidia et al.Theorem: If GBP converges, resulting Ci are stationary point (usually local maxima) of Kikuchi Free energy! DifficultySATGradeHappyJobCoherenceLetterIntelligenceDIGGJSLHGJCDGSI10-708 –  Carlos Guestrin 20067 What you need to know about GBPSpectrum between Loopy BP & Junction Trees:More computation, but typically better answersIf satisfies RIP, equations are very simpleGeneral setting, slightly trickier equations, but not hardRelates to variational methods: Corresponds to local optima of approximate version of energy functional10-708 – Carlos Guestrin 20068 AnnouncementsTomorrow’s recitationKhalid on learning Markov Networks10-708 – Carlos Guestrin 20069 Learning Parameters of a BNLog likelihood decomposes:Learn each CPT independentlyUse countsDSGHJCLI10-708 – Carlos Guestrin 200610 Log Likelihood for MNLog likelihood of the data:DifficultySATGradeHappyJobCoherenceLetterIntelligence10-708 – Carlos Guestrin 200611 Log Likelihood doesn’t decompose for MNsLog likelihood:A concave problemCan find global optimum!!Term log Z doesn’t decompose!!DifficultySATGradeHappyJobCoherenceLetterIntelligence10-708 – Carlos Guestrin 200612 Derivative of Log Likelihood for MNsDifficultySATGradeHappyJobCoherenceLetterIntelligence10-708 – Carlos Guestrin 200613 Derivative of Log Likelihood for MNsDifficultySATGradeHappyJobCoherenceLetterIntelligenceDerivative:Setting derivative to zeroCan optimize using gradient ascentLet’s look at a simpler solution10-708 – Carlos Guestrin 200614 Iterative Proportional Fitting (IPF)DifficultySATGradeHappyJobCoherenceLetterIntelligenceSetting derivative to zero:Fixed point equation:Iterate and converge to optimal parametersEach iteration, must compute:10-708 – Carlos Guestrin 200615 Log-linear Markov network(most common representation)Feature is some function  [D] for some subset of variables De.g., indicator functionLog-linear model over a Markov network H:a set of features 1[D1],…, k[Dk]each Di is a subset of a clique in Htwo ’s can be over the same variablesa set of weights w1,…,wkusually learned from data10-708 – Carlos Guestrin 200616 Learning params for log linear models 1 – Generalized Iterative ScalingIPF generalizes easily if: i(x) ¸ 0 i i(x) = 1Update rule:Must compute:If conditions violated, equations are not so simple…c.f., Improved Iterative Scaling [Berger ’97]10-708 – Carlos Guestrin 200617 Learning params for log linear models 2 – Gradient Ascent Log-likelihood of data:Compute derivative & optimizeusually with conjugate gradient ascentYou will do an example in your homework! 10-708 – Carlos Guestrin 200618 What you need to know about learning MN parameters?BN parameter learning easyMN parameter learning doesn’t decompose!Learning requires inference!Apply gradient ascent or IPF iterations to obtain optimal parametersapplies to both tabular representations and log-linear models10-708 – Carlos Guestrin 200619 Thus far, fully supervised learningWe have assumed fully supervised learning:Many real problems have missing data:10-708 – Carlos Guestrin 200620 The general learning problem with missing dataMarginal likelihood – x is observed, z is missing:10-708 – Carlos Guestrin 200621 E-stepx is observed, z is missingCompute probability of missing data given current choice of Q(z|xj) for each xj e.g., probability computed during classification stepcorresponds to “classification step” in K-means10-708 –  Carlos Guestrin 200622 Jensen’s inequality Theorem: log z P(z) f(z) ¸ z P(z) log f(z)10-708 – Carlos Guestrin 200623 Applying Jensen’s inequalityUse: log z P(z) f(z) ¸ z P(z) log f(z)10-708 – Carlos Guestrin 200624 The M-step maximizes lower bound on weighted dataLower bound from Jensen’s:Corresponds to weighted

View Full Document