EM for BNsThus far, fully supervised learningThe general learning problem with missing dataE-stepJensen’s inequalityApplying Jensen’s inequalityThe M-step maximizes lower bound on weighted dataThe M-stepConvergence of EMData likelihood for BNsMarginal likelihoodLog likelihood for BNs with hidden dataE-step for BNsThe M-step for BNsM-step for each CPTComputing expected countsData need not be hidden in the same wayPoster printingEM for BNs & identifiability: a superficial discussionLearning structure with missing data [K&F 18.4]What you need to know about learning BNs with missing dataMNs & CRFs with missing dataKalman Filters Gaussian BNsAdventures of our BN heroThe Kalman FilterExample of KF – SLAT Simultaneous Localization and TrackingExample of KF – SLAT Simultaneous Localization and TrackingMultivariate GaussianConditioning a GaussianGaussian is a “Linear Model”Slide 31Conditional Linear Gaussian (CLG) – general caseUnderstanding a linear Gaussian – the 2d caseTracking with a Gaussian 1Tracking with Gaussians 2 – Making observations1EM for BNsGraphical Models – 10708Carlos GuestrinCarnegie Mellon UniversityNovember 24th, 2008Readings: 18.1, 18.2, 18.310-708 – Carlos Guestrin 2006-200810-708 – Carlos Guestrin 2006-20082Thus far, fully supervised learningWe have assumed fully supervised learning:Many real problems have missing data:10-708 – Carlos Guestrin 2006-20083The general learning problem with missing dataMarginal likelihood – x is observed, z is missing:10-708 – Carlos Guestrin 2006-20084E-stepx is observed, z is missingCompute probability of missing data given current choice of Q(z|x(j)) for each x(j) e.g., probability computed during classification stepcorresponds to “classification step” in K-means10-708 – Carlos Guestrin 2006-20085Jensen’s inequality Theorem: log z P(z) f(z) ≥ z P(z) log f(z)10-708 – Carlos Guestrin 2006-20086Applying Jensen’s inequalityUse: log z P(z) f(z) ≥ z P(z) log f(z)10-708 – Carlos Guestrin 2006-20087The M-step maximizes lower bound on weighted dataLower bound from Jensen’s:Corresponds to weighted dataset:<x(1),z=1> with weight Q(t+1)(z=1|x(1))<x(1),z=2> with weight Q(t+1)(z=2|x(1))<x(1),z=3> with weight Q(t+1)(z=3|x(1))<x(2),z=1> with weight Q(t+1)(z=1|x(2))<x(2),z=2> with weight Q(t+1)(z=2|x(2))<x(2),z=3> with weight Q(t+1)(z=3|x(2))…10-708 – Carlos Guestrin 2006-20088The M-stepMaximization step:Use expected counts instead of counts:If learning requires Count(x,z)Use EQ(t+1)[Count(x,z)]10-708 – Carlos Guestrin 2006-20089Convergence of EMDefine potential function F(,Q):EM corresponds to coordinate ascent on FThus, maximizes lower bound on marginal log likelihoodAs seen in Machine Learning class last semester10-708 – Carlos Guestrin 200610Data likelihood for BNsGiven structure, log likelihood of fully observed data:FluAllergySinusHeadacheNose10-708 – Carlos Guestrin 200611Marginal likelihoodWhat if S is hidden?FluAllergySinusHeadacheNose10-708 – Carlos Guestrin 200612Log likelihood for BNs with hidden dataMarginal likelihood – O is observed, H is hiddenFluAllergySinusHeadacheNose10-708 – Carlos Guestrin 200613E-step for BNsE-step computes probability of hidden vars h given oCorresponds to inference in BNFluAllergySinusHeadacheNose10-708 – Carlos Guestrin 200614The M-step for BNsMaximization step:Use expected counts instead of counts:If learning requires Count(h,o)Use EQ(t+1)[Count(h,o)]FluAllergySinusHeadacheNose10-708 – Carlos Guestrin 200615M-step for each CPTM-step decomposes per CPTStandard MLE:M-step uses expected counts:FluAllergySinusHeadacheNose10-708 – Carlos Guestrin 200616Computing expected countsM-step requires expected counts:Observe O=oFor a set of vars A, must compute ExCount(A=a)Some of A in example j will be observeddenote by AO = aO(j)Some of A will be hiddendenote by AHUse inference (E-step computes expected counts):ExCount(t+1)(AO = aO, AH = aH)FluAllergySinusHeadacheNose10-708 – Carlos Guestrin 200617Data need not be hidden in the same wayWhen data is fully observedA data point is When data is partially observedA data point is But unobserved variables can be different for different data pointse.g.,Same framework, just change definition of expected countsObserved vars in point j, Consider set of vars AExCount(t+1)(A = a) FluAllergySinusHeadacheNosePoster printingPoster session:Friday Dec 1st, 3-6pm in the NSH Atrium.There will be a popular vote for best poster. Invite your friends!please be ready to set up your poster at 2:45pm sharp.We will provide posterboards, easels and pins. The posterboards are 30x40 inchesWe don't have a specific poster format for you to use. You can either bring a big poster or a print a set of regular sized pages and pin them together.Unfortunately, we don't have a budget to pay for printing. If you are an SCS student, SCS has a poster printer you can use which prints on a 36" wide roll of paper. If you are a student outside SCS, you will need to check with your department to see if there are printing facilities for big posters (I don't know what is offered outside SCS), or print a set of regular sized pages.We are looking forward to a great poster session!10-708 – Carlos Guestrin 2006-200818EM for BNs & identifiability: a superficial discussionWhat happens if a leaf is never observed?10-708 – Carlos Guestrin 2006-20081910-708 – Carlos Guestrin 200620Learning structure with missing data[K&F 18.4]Known BN structure: Use expected counts, learning algorithm doesn’t changeUnknown BN structure: Can use expected counts and score model as when we talked about structure learningBut, very slow...e.g., greedy algorithm would need to redo inference for every edge we test…(Much Faster) Structure-EM: Iterate:compute expected countsdo a some structure search (e.g., many greedy steps)repeatTheorem: Converges to local optima of marginal log-likelihood details in the bookFluAllergySinusHeadacheNose10-708 – Carlos Guestrin 200621What you need to know about learning BNs with missing dataEM for Bayes NetsE-step: inference computes expected countsOnly need expected counts over Xi and PaxiM-step:
View Full Document