Machine learning: lecture 24Tommi S. JaakkolaMIT [email protected]• Learning Bayes ian networks: complete data– estimating the parameters with fixed structure– learning the graph structure• Learning Bayes ian networks: incomplete data– EM and structural EM• ReviewTommi Jaakkola, MIT CSAIL 2Probabilities and conditional tables• Simple example: three discrete variables, each taking mpossible valuesP (x1) = θx1x3x1x2P (x2) = θx2P (x3|x1, x2) = θx3|x1,x2We assume that the model is fully parameterized in thesense that {θx1}, {θx2}, and {θx3|x1,x2for each distinctconfiguration of x1and x2} are unrestricted and can bechosen independently of each other.Tommi Jaakkola, MIT CSAIL 3Likelihood and complete data• When the observed data points are complete, the likelihoodhas a simple form:P (D|G, θ) =nYt=1P (xt1)P (xt2)P (xt3|xt1, xt2)=Yx1P (x1)N(x1)×Yx2P (x2)N(x2)×Yx1,x2Yx3P (x3|x1, x2)N(x1,x2,x3)Tommi Jaakkola, MIT CSAIL 4Likelihood and complete data• When the observed data points are complete, the likelihoodhas a simple form:P (D|G, θ) =nYt=1P (xt1)P (xt2)P (xt3|xt1, xt2)=Yx1θN(x1)x1×Yx2θN(x2)x2×Yx1,x2Yx3θN(x1,x2,x3)x3|x1,x2Each conditional table such as θx3|x1,x2for a fixe d x1and x2,can be estimated separately based on the observed countsN(x1, x2, x3).Tommi Jaakkola, MIT CSAIL 5ML parameter estimatesP (D|G, θ) =Yx1θN(x1)x1×Yx2θN(x2)x2×Yx1,x2Yx3θN(x1,x2,x3)x3|x1,x2• The maximum likelihood estimates of parameters θ·|x1,x2={θ1|x1,x2, . . . , θm|x1,x2} for each fixed configuration of x1andx2are simply normalized counts (cf. Markov models):ˆθx3|x1,x2=N(x1, x2, x3)N(x1, x2)where N(x1, x2) =Px3N(x1, x2, x3).Tommi Jaakkola, MIT CSAIL 6Bayesian estimates: the prior• We introduce independent priors, e.g., P3(θ·|x1,x2), acrossvariables and for e ach distinct configuration of the parentsP (θ|G) = P1(θ·) × P2(θ·) ×Yx1,x2P3(θ·|x1,x2)P (x1) = θx1x3x1x2P (x2) = θx2P (x3|x1, x2) = θx3|x1,x2Tommi Jaakkola, MIT CSAIL 7Bayesian estimates: the prior• We introduce independent priors, e.g., P3(θ·|x1,x2), acrossvariables and for each distinct configuration of the parents• Moreove r, we assume that these priors are Dirichlet:P3(θ·|x1,x2) =1Z0Yx3θN0(x1,x2,x3)−1x3|x1,x2with hyper-paramete rs (parameters of the prior)N0(x1, x2, x3) ≥ 0, interpreted as prior “counts”.Tommi Jaakkola, MIT CSAIL 8Bayesian estimates: the prior• We introduce independent priors, e.g., P3(θ·|x1,x2), acrossvariables and for each distinct configuration of the parents• Moreove r, we assume that these priors are Dirichlet:P3(θ·|x1,x2) =1Z0Yx3θN0(x1,x2,x3)−1x3|x1,x2with hyper-paramete rs (parameters of the prior)N0(x1, x2, x3) ≥ 0, interpreted as prior “counts”.This prior is concentrated aroundθ0x3|x1,x2=N0(x1, x2, x3)N0(x1, x2)where N0(x1, x2) =Px3N0(x1, x2, x3), and more so forlarger values of N0(x1, x2).Tommi Jaakkola, MIT CSAIL 9Bayesian estimates: the posterior• The posterior is also DirichletP3(θ·|x1,x2|D) ∝likelihoodz }| {Yx3θN(x1,x2,x3)x3|x1,x2·priorz }| {Yx3θN0(x1,x2,x3)−1x3|x1,x2=1ZYx3θN0(x1,x2,x3)+N(x1,x2,x3)−1x3|x1,x2with hyper-parameters N0(x1, x2, x3) + N(x1, x2, x3).Tommi Jaakkola, MIT CSAIL 10Outline• Learning Bayes ian networks: complete data– estimating the parameters with fixed structure– learning the graph structure• Learning Bayes ian networks: incomplete data– EM and structural EM• ReviewTommi Jaakkola, MIT CSAIL 11Bayesian score• We can use the marginallikelihood (Bayesian score)as a model (graph) selectioncriterion:P (x1) = θx1x3x1x2P (x2) = θx2P (x3|x1, x2) = θx3|x1,x2P (D|G) =ZP (D|G, θ)P (θ|G)dθTommi Jaakkola, MIT CSAIL 12Bayesian score• We can use the marginallikelihood (Bayesian score)as a model (graph) selectioncriterion:P (x1) = θx1x3x1x2P (x2) = θx2P (x3|x1, x2) = θx3|x1,x2P (D|G) =ZP (D|G, θ)P (θ|G)dθ• The form of the likelihood and the priorP (D|G, θ) =Yx1θN(x1)x1×Yx2θN(x2)x2×Yx1,x2Yx3θN(x1,x2,x3)x3|x1,x2P (θ|G) ∝Yx1θN0(x1)x1| {z }P1(θ·)×Yx2θN0(x2)x2| {z }P2(θ·)×Yx1,x2Yx3θN0(x1,x2,x3)x3|x1,x2| {z }P3(θ·|x1,x2)permit us to evaluate the Bayesian score locally.Tommi Jaakkola, MIT CSAIL 13Bayesian score: graphsP (x1) = θx1x3x1x2P (x2) = θx2P (x3|x1, x2) = θx3|x1,x2• The Bayesian score reduces to a product of local terms:P (D|G) =ZP (D|G, θ)P (θ|G)dθ=ZYx1θN(x1)x1×Yx2θN(x2)x2×Yx1,x2Yx3θN(x1,x2,x3)x3|x1,x2P (θ|G)dθ= P (D1|G) × P (D2|G) ×Yx1,x2P (D3|x1,x2|G)Tommi Jaakkola, MIT CSAIL 14Learning Bayesian networks• We can perform a greedy search over (equivalence classesof) Bayesian networks based on the score:x3x1x2x3x1x2x3x1x2x3x1x2x3x1x2x3x1x2x3x1x2x3x1x2x3x1x2x3x1x2x3x1x2Tommi Jaakkola, MIT CSAIL 15Review for the final• The final is comprehensive• Major concepts– regression, active learning– classification, margins, kernels, feature selection– over-fitting, regularization, generalization, model selection– latent variable models, estimation with incomplete data– clustering, objectives– graphs and probabilities, inferenceTommi Jaakkola, MIT CSAIL
View Full Document