Machine learning: lecture 14Tommi S. JaakkolaMIT [email protected]• Probabilistic modeling– mixture models and the EM algorithm– mixtures of experts, examples– hierarchical modelsTommi Jaakkola, MIT CSAIL 2Review: mixture density• Data generation process:P(x|y=1) P(x|y=2)y=1 y=2 P(y) 2 1 p(x|θ) =Xj=1,2pj· p(x|µj, Σj) (mixture of Gaussians)• Any data point x could have been generated in two ways; thecomponent responsible for generating x needs to be inferred.Tommi Jaakkola, MIT CSAIL 3Review: the EM algorithmE-step: softly assign examples to mixture componentsˆp(j|i) ← P (yi= j|xi, θ), for all j = 1, 2 and i = 1, . . . , nM-step: re-estimate the parameters (separately for the twoGaussians) based on the soft assignments.ˆpj←ˆnjn, where ˆnj=nXi=1ˆp(j|i)ˆµj←1ˆnjnXi=1ˆp(j|i) xiˆΣj←1ˆnjnXi=1ˆp(j|i) (xi− ˆµj)(xi− ˆµj)TTommi Jaakkola, MIT CSAIL 4Mixture demoTommi Jaakkola, MIT CSAIL 5The EM-algorithm• Each iteration of the EM-algorithm monotonically increasesthe (log-)likelihood of the n training examples x1, . . . , xn:l(D; θ) =nXi=1logp(xi|θ)z }| {p1p(xi|µ1, Σ1) + p2p(xi|µ2, Σ2)where θ = {p1, p2, µ1, µ2, Σ1, Σ2} contains all the parametersof the mixture model.0 5 10 15 20 25 30 35−500−400−300−200−1000100200Tommi Jaakkola, MIT CSAIL 6The EM algorithm: why does it work?• To show that EM increases the log-like lihood of the data aftereach iteration we resort to the following auxiliary objectiveJ(q; θ) =XiXj=1,2q(j|i) logpjp(xi|µj, Σj)q(j|i)where q(j|i) are soft assignments of examples to mixturecomponents. We view both q and θ as parameters to beoptimized.• We need to specify– how we will use this objective (algorithm)– how the algorithm relates to EM– how the objective relates to the log-likelihoodTommi Jaakkola, MIT CSAIL 7The auxiliary objective: algorithmJ(q; θ) =XiXj=1,2q(j|i) logpjp(xi|µj, Σj)q(j|i)• We maximize this objective in t he following alternating steps.Let θ(0)be the initial setting of the parameters, then atiteration k:q(k)= argmaxqJ(q; θ(k)) (E-step)θ(k+1)= argmaxθJ(q(k); θ) (M-step)Tommi Jaakkola, MIT CSAIL 8The auxiliary objective: algorithmJ(q; θ) =XiXj=1,2q(j|i) logpjp(xi|µj, Σj)q(j|i)• We maximize this objective in t he following alternating steps.Let θ(0)be the initial setting of the parameters, then atiteration k:q(k)= argmaxqJ(q; θ(k)) (E-step)θ(k+1)= argmaxθJ(q(k); θ) (M-step)• Clearly these ste ps lead to a monotonically increasing valueof the objective J(q; θ)J(q(k−1); θ(k)) ≤ J(q(k); θ(k)) ≤ J(q(k); θ(k+1))Tommi Jaakkola, MIT CSAIL 9Relation to EM• Suppose we set q(j|i) to be the posterior assignments ˆp(j|i)based on the current setting of the parameters θ:J(ˆp; θ) =XiXj=1,2ˆp(j|i) logpjp(xi|µj, Σj)ˆp(j|i)Tommi Jaakkola, MIT CSAIL 10Relation to EM• Suppose we set q(j|i) to be the posterior assignments ˆp(j|i)based on the current setting of the parameters θ:J(ˆp; θ) =XiXj=1,2ˆp(j|i) logpjp(xi|µj, Σj)ˆp(j|i)=XiXj=1,2ˆp(j|i) logpjp(xi|µj, Σj) + const.Tommi Jaakkola, MIT CSAIL 11Relation to EM• Suppose we set q(j|i) to be the posterior assignments ˆp(j|i)based on the current setting of the parameters θ:J(ˆp; θ) =XiXj=1,2ˆp(j|i) logpjp(xi|µj, Σj)ˆp(j|i)=XiXj=1,2ˆp(j|i) logpjp(xi|µj, Σj) + const.=Xj=1,2"Xiˆp(j|i) logpjp(xi|µj, Σj) #+ const.Tommi Jaakkola, MIT CSAIL 12Relation to EM• Suppose we set q(j|i) to be the posterior assignments ˆp(j|i)based on the current setting of the parameters θ:J(ˆp; θ) =XiXj=1,2ˆp(j|i) logpjp(xi|µj, Σj)ˆp(j|i)=XiXj=1,2ˆp(j|i) logpjp(xi|µj, Σj) + const.=Xj=1,2"Xiˆp(j|i) logpjp(xi|µj, Σj) #| {z }weighted log-likelihood+const.The estimation of θ reduces to finding ML Gaussians basedon a weighted training set, separately for each component⇒ M-step of the EM-algorithm.Tommi Jaakkola, MIT CSAIL 13Relation to the log-likelihoodJ(q; θ) =XiXj=1,2q(j|i) logpjp(xi|µj, Σj)q(j|i)• This objective has two relevant properties for our purposes:J(q; θ) ≤ l(D; θ) (1)J(ˆp; θ) = l(D; θ) (2)where ˆp(j|i) = P (j|xi, θ) are the pos terior assignments.Tommi Jaakkola, MIT CSAIL 14Proof of property 1For any setting of q(j|i)J(q; θ) =XiXj=1,2q(j|i) logpjp(xi|µj, Σj)q(j|i)Tommi Jaakkola, MIT CSAIL 15Proof of property 1For any setting of q(j|i)J(q; θ) =XiXj=1,2q(j|i) logpjp(xi|µj, Σj)q(j|i)≤XilogXj=1,2q(j|i)pjp(xi|µj, Σj)q(j|i)Tommi Jaakkola, MIT CSAIL 16Proof of property 1For any setting of q(j|i)J(q; θ) =XiXj=1,2q(j|i) logpjp(xi|µj, Σj)q(j|i)≤XilogXj=1,2q(j|i)pjp(xi|µj, Σj)q(j|i)=XilogXj=1,2pjp(xi|µj, Σj)Tommi Jaakkola, MIT CSAIL 17Proof of property 1For any setting of q(j|i)J(q; θ) =XiXj=1,2q(j|i) logpjp(xi|µj, Σj)q(j|i)≤XilogXj=1,2q(j|i)pjp(xi|µj, Σj)q(j|i)=XilogXj=1,2pjp(xi|µj, Σj)=Xilog p(xi|θ) = l(D; θ)Tommi Jaakkola, MIT CSAIL 18Proof of property 2By setting q(j|i) equal to the posterior assignmentsˆp(j|i) =pjp(xi|µj, Σj)Pj0=1,2pj0p(xi|µj0, Σj0)=pjp(xi|µj, Σj)p(xi|θ)we getJ(ˆp; θ) =XiXj=1,2ˆp(j|i) logpjp(xi|µj, Σj)ˆp(j|i)Tommi Jaakkola, MIT CSAIL 19Proof of property 2By setting q(j|i) equal to the posterior assignmentsˆp(j|i) =pjp(xi|µj, Σj)Pj0=1,2pj0p(xi|µj0, Σj0)=pjp(xi|µj, Σj)p(xi|θ)we getJ(ˆp; θ) =XiXj=1,2ˆp(j|i) logpjp(xi|µj, Σj)ˆp(j|i)=XiXj=1,2ˆp(j|i) logp(xi|θ)Tommi Jaakkola, MIT CSAIL 20Proof of property 2By setting q(j|i) equal to the posterior assignmentsˆp(j|i) =pjp(xi|µj, Σj)Pj0=1,2pj0p(xi|µj0, Σj0)=pjp(xi|µj, Σj)p(xi|θ)we getJ(ˆp; θ) =XiXj=1,2ˆp(j|i) logpjp(xi|µj, Σj)ˆp(j|i)=XiXj=1,2ˆp(j|i) logp(xi|θ)=Xilog p(xi|θ) = l(D; θ)Tommi Jaakkola, MIT CSAIL 21Completing the argument• The algorithm and propertiesq(k)= argmaxqJ(q; θ(k)) (E-step)θ(k+1)= argmaxθJ(q(k); θ) (M-step)J(q; θ) ≤ l(D; θ) (property 1)J(ˆp; θ) = l(D; θ) (property 2)• We can now set up the following chain of inequalitiesl(D; θ(k)) = J(q(k); θ(k)) (E-step, prop. 2)Tommi Jaakkola, MIT CSAIL 22Completing the argument• The algorithm and propertiesq(k)= argmaxqJ(q; θ(k)) (E-step)θ(k+1)= argmaxθJ(q(k); θ) (M-step)J(q; θ) ≤ l(D; θ) (property 1)J(ˆp; θ) = l(D; θ) (property 2)• We can now set up the following chain of inequalitiesl(D; θ(k)) = J(q(k); θ(k)) (E-step, prop. 2)≤ J(q(k); θ(k+1))
View Full Document