Machine learning lecture 24 Tommi S Jaakkola MIT CSAIL tommi csail mit edu Outline Learning Bayesian networks complete data estimating the parameters with fixed structure learning the graph structure Learning Bayesian networks incomplete data EM and structural EM Review Tommi Jaakkola MIT CSAIL 2 Probabilities and conditional tables Simple example three discrete variables each taking m possible values x1 P x1 x1 x2 P x2 x2 x3 P x3 x1 x2 x3 x1 x2 We assume that the model is fully parameterized in the sense that x1 x2 and x3 x1 x2 for each distinct configuration of x1 and x2 are unrestricted and can be chosen independently of each other Tommi Jaakkola MIT CSAIL 3 Likelihood and complete data When the observed data points are complete the likelihood has a simple form P D G n Y P xt1 P xt2 P xt3 xt1 xt2 t 1 Y P x1 N x1 x1 Tommi Jaakkola MIT CSAIL P x2 N x2 x2 Y Y x1 x2 Y P x3 x1 x2 N x1 x2 x3 x3 4 Likelihood and complete data When the observed data points are complete the likelihood has a simple form P D G n Y P xt1 P xt2 P xt3 xt1 xt2 t 1 Y xN1 x1 x1 xN2 x2 x2 Y Y x1 x2 Y N x x x3 x3 x11 x22 x3 Each conditional table such as x3 x1 x2 for a fixed x1 and x2 can be estimated separately based on the observed counts N x1 x2 x3 Tommi Jaakkola MIT CSAIL 5 ML parameter estimates Y P D G xN1 x1 Y x1 x2 Y Y x1 x2 xN2 x2 N x x x3 x3 x11 x22 x3 The maximum likelihood estimates of parameters x1 x2 1 x1 x2 m x1 x2 for each fixed configuration of x1 and x2 are simply normalized counts cf Markov models x3 x1 x2 where N x1 x2 Tommi Jaakkola MIT CSAIL N x1 x2 x3 N x1 x2 P x3 N x1 x2 x3 6 Bayesian estimates the prior We introduce independent priors e g P3 x1 x2 across variables and for each distinct configuration of the parents Y P G P1 P2 P3 x1 x2 x1 x2 x1 P x1 x1 x2 P x2 x2 x3 P x3 x1 x2 x3 x1 x2 Tommi Jaakkola MIT CSAIL 7 Bayesian estimates the prior We introduce independent priors e g P3 x1 x2 across variables and for each distinct configuration of the parents Moreover we assume that these priors are Dirichlet 1 Y N 0 x1 x2 x3 1 P3 x1 x2 0 x3 x1 x2 Z x 3 with hyper parameters parameters of the N 0 x1 x2 x3 0 interpreted as prior counts Tommi Jaakkola MIT CSAIL prior 8 Bayesian estimates the prior We introduce independent priors e g P3 x1 x2 across variables and for each distinct configuration of the parents Moreover we assume that these priors are Dirichlet 1 Y N 0 x1 x2 x3 1 P3 x1 x2 0 x3 x1 x2 Z x 3 with hyper parameters parameters of the N 0 x1 x2 x3 0 interpreted as prior counts prior This prior is concentrated around 0 N x1 x2 x3 0 x3 x1 x2 N 0 x1 x2 P 0 0 where N x1 x2 N x1 x2 x3 and more so for x3 larger values of N 0 x1 x2 Tommi Jaakkola MIT CSAIL 9 Bayesian estimates the posterior The posterior is also Dirichlet prior likelihood z zY Y 0 N x x x N x x x 1 P3 x1 x2 D x3 x11 x22 3 x3 x11 x22 3 x3 x3 1 Y N 0 x1 x2 x3 N x1 x2 x3 1 x3 x1 x2 Z x 3 with hyper parameters N 0 x1 x2 x3 N x1 x2 x3 Tommi Jaakkola MIT CSAIL 10 Outline Learning Bayesian networks complete data estimating the parameters with fixed structure learning the graph structure Learning Bayesian networks incomplete data EM and structural EM Review Tommi Jaakkola MIT CSAIL 11 Bayesian score x x 1 2 We can use the marginal P x1 x1 P x2 x2 likelihood Bayesian score as a model graph selection x3 criterion P x3 x1 x2 x3 x1 x2 Z P D G P D G P G d Tommi Jaakkola MIT CSAIL 12 Bayesian score x x 1 2 We can use the marginal P x1 x1 P x2 x2 likelihood Bayesian score as a model graph selection x3 criterion P x3 x1 x2 x3 x1 x2 Z P D G P D G P G d The form of the likelihood and the prior Y Y Y Y N x x x N x1 N x2 x3 x11 x22 3 P D G x1 x2 x1 P G x1 x2 x3 x2 Y N 0 x1 x1 x1 z P1 Y N 0 x2 x2 x2 z P2 Y Y N 0 x1 x2 x3 x3 x1 x2 x1 x2 x3 z P3 x x 1 2 permit us to evaluate the Bayesian score locally Tommi Jaakkola MIT CSAIL 13 Bayesian score graphs x1 x2 P x1 x1 P x2 x2 x3 P x3 x1 x2 x3 x1 x2 The Bayesian score reduces to a product of local terms Z P D G P D G P G d Z Y Y Y Y N x x x N x N x 1 2 3 2 x 1 P G d x2 x3 x1 x2 1 x1 x2 P D1 G P D2 G x1 x2 x3 Y P D3 x1 x2 G x1 x2 Tommi Jaakkola MIT CSAIL 14 Learning Bayesian networks We can perform a greedy search over equivalence classes of Bayesian networks based on the score x1 x2 x1 x3 x2 x1 x3 x1 x1 x3 Tommi Jaakkola MIT CSAIL x2 x1 x1 x2 x3 x2 x3 x2 x3 x3 x2 x3 x1 x3 x2 x1 x2 x1 x2 x3 x1 x2 x3 15 Review for the final The final is comprehensive Major concepts regression active learning classification margins kernels feature selection over fitting regularization generalization model selection latent variable models estimation with incomplete data clustering objectives graphs and probabilities inference Tommi Jaakkola MIT CSAIL 16
View Full Document
Unlocking...