DOC PREVIEW
MIT 6 867 - Machine learning: lecture 14

This preview shows page 1-2-14-15-30-31 out of 31 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 31 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 31 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 31 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 31 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 31 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 31 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 31 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Machine learning: lecture 14Tommi S. JaakkolaMIT [email protected]• Probabilistic modeling– mixture models and the EM algorithm– mixtures of experts, examples– hierarchical modelsTommi Jaakkola, MIT CSAIL 2Review: mixture density• Data generation process:P(x|y=1) P(x|y=2)y=1 y=2 P(y) 2 1 p(x|θ) =Xj=1,2pj· p(x|µj, Σj) (mixture of Gaussians)• Any data point x could have been generated in two ways; thecomponent responsible for generating x needs to be inferred.Tommi Jaakkola, MIT CSAIL 3Review: the EM algorithmE-step: softly assign examples to mixture componentsˆp(j|i) ← P (yi= j|xi, θ), for all j = 1, 2 and i = 1, . . . , nM-step: re-estimate the parameters (separately for the twoGaussians) based on the soft assignments.ˆpj←ˆnjn, where ˆnj=nXi=1ˆp(j|i)ˆµj←1ˆnjnXi=1ˆp(j|i) xiˆΣj←1ˆnjnXi=1ˆp(j|i) (xi− ˆµj)(xi− ˆµj)TTommi Jaakkola, MIT CSAIL 4Mixture demoTommi Jaakkola, MIT CSAIL 5The EM-algorithm• Each iteration of the EM-algorithm monotonically increasesthe (log-)likelihood of the n training examples x1, . . . , xn:l(D; θ) =nXi=1logp(xi|θ)z }| {p1p(xi|µ1, Σ1) + p2p(xi|µ2, Σ2)where θ = {p1, p2, µ1, µ2, Σ1, Σ2} contains all the parametersof the mixture model.0 5 10 15 20 25 30 35−500−400−300−200−1000100200Tommi Jaakkola, MIT CSAIL 6The EM algorithm: why does it work?• To show that EM increases the log-like lihood of the data aftereach iteration we resort to the following auxiliary objectiveJ(q; θ) =XiXj=1,2q(j|i) logpjp(xi|µj, Σj)q(j|i)where q(j|i) are soft assignments of examples to mixturecomponents. We view both q and θ as parameters to beoptimized.• We need to specify– how we will use this objective (algorithm)– how the algorithm relates to EM– how the objective relates to the log-likelihoodTommi Jaakkola, MIT CSAIL 7The auxiliary objective: algorithmJ(q; θ) =XiXj=1,2q(j|i) logpjp(xi|µj, Σj)q(j|i)• We maximize this objective in t he following alternating steps.Let θ(0)be the initial setting of the parameters, then atiteration k:q(k)= argmaxqJ(q; θ(k)) (E-step)θ(k+1)= argmaxθJ(q(k); θ) (M-step)Tommi Jaakkola, MIT CSAIL 8The auxiliary objective: algorithmJ(q; θ) =XiXj=1,2q(j|i) logpjp(xi|µj, Σj)q(j|i)• We maximize this objective in t he following alternating steps.Let θ(0)be the initial setting of the parameters, then atiteration k:q(k)= argmaxqJ(q; θ(k)) (E-step)θ(k+1)= argmaxθJ(q(k); θ) (M-step)• Clearly these ste ps lead to a monotonically increasing valueof the objective J(q; θ)J(q(k−1); θ(k)) ≤ J(q(k); θ(k)) ≤ J(q(k); θ(k+1))Tommi Jaakkola, MIT CSAIL 9Relation to EM• Suppose we set q(j|i) to be the posterior assignments ˆp(j|i)based on the current setting of the parameters θ:J(ˆp; θ) =XiXj=1,2ˆp(j|i) logpjp(xi|µj, Σj)ˆp(j|i)Tommi Jaakkola, MIT CSAIL 10Relation to EM• Suppose we set q(j|i) to be the posterior assignments ˆp(j|i)based on the current setting of the parameters θ:J(ˆp; θ) =XiXj=1,2ˆp(j|i) logpjp(xi|µj, Σj)ˆp(j|i)=XiXj=1,2ˆp(j|i) logpjp(xi|µj, Σj) + const.Tommi Jaakkola, MIT CSAIL 11Relation to EM• Suppose we set q(j|i) to be the posterior assignments ˆp(j|i)based on the current setting of the parameters θ:J(ˆp; θ) =XiXj=1,2ˆp(j|i) logpjp(xi|µj, Σj)ˆp(j|i)=XiXj=1,2ˆp(j|i) logpjp(xi|µj, Σj) + const.=Xj=1,2"Xiˆp(j|i) logpjp(xi|µj, Σj) #+ const.Tommi Jaakkola, MIT CSAIL 12Relation to EM• Suppose we set q(j|i) to be the posterior assignments ˆp(j|i)based on the current setting of the parameters θ:J(ˆp; θ) =XiXj=1,2ˆp(j|i) logpjp(xi|µj, Σj)ˆp(j|i)=XiXj=1,2ˆp(j|i) logpjp(xi|µj, Σj) + const.=Xj=1,2"Xiˆp(j|i) logpjp(xi|µj, Σj) #| {z }weighted log-likelihood+const.The estimation of θ reduces to finding ML Gaussians basedon a weighted training set, separately for each component⇒ M-step of the EM-algorithm.Tommi Jaakkola, MIT CSAIL 13Relation to the log-likelihoodJ(q; θ) =XiXj=1,2q(j|i) logpjp(xi|µj, Σj)q(j|i)• This objective has two relevant properties for our purposes:J(q; θ) ≤ l(D; θ) (1)J(ˆp; θ) = l(D; θ) (2)where ˆp(j|i) = P (j|xi, θ) are the pos terior assignments.Tommi Jaakkola, MIT CSAIL 14Proof of property 1For any setting of q(j|i)J(q; θ) =XiXj=1,2q(j|i) logpjp(xi|µj, Σj)q(j|i)Tommi Jaakkola, MIT CSAIL 15Proof of property 1For any setting of q(j|i)J(q; θ) =XiXj=1,2q(j|i) logpjp(xi|µj, Σj)q(j|i)≤XilogXj=1,2q(j|i)pjp(xi|µj, Σj)q(j|i)Tommi Jaakkola, MIT CSAIL 16Proof of property 1For any setting of q(j|i)J(q; θ) =XiXj=1,2q(j|i) logpjp(xi|µj, Σj)q(j|i)≤XilogXj=1,2q(j|i)pjp(xi|µj, Σj)q(j|i)=XilogXj=1,2pjp(xi|µj, Σj)Tommi Jaakkola, MIT CSAIL 17Proof of property 1For any setting of q(j|i)J(q; θ) =XiXj=1,2q(j|i) logpjp(xi|µj, Σj)q(j|i)≤XilogXj=1,2q(j|i)pjp(xi|µj, Σj)q(j|i)=XilogXj=1,2pjp(xi|µj, Σj)=Xilog p(xi|θ) = l(D; θ)Tommi Jaakkola, MIT CSAIL 18Proof of property 2By setting q(j|i) equal to the posterior assignmentsˆp(j|i) =pjp(xi|µj, Σj)Pj0=1,2pj0p(xi|µj0, Σj0)=pjp(xi|µj, Σj)p(xi|θ)we getJ(ˆp; θ) =XiXj=1,2ˆp(j|i) logpjp(xi|µj, Σj)ˆp(j|i)Tommi Jaakkola, MIT CSAIL 19Proof of property 2By setting q(j|i) equal to the posterior assignmentsˆp(j|i) =pjp(xi|µj, Σj)Pj0=1,2pj0p(xi|µj0, Σj0)=pjp(xi|µj, Σj)p(xi|θ)we getJ(ˆp; θ) =XiXj=1,2ˆp(j|i) logpjp(xi|µj, Σj)ˆp(j|i)=XiXj=1,2ˆp(j|i) logp(xi|θ)Tommi Jaakkola, MIT CSAIL 20Proof of property 2By setting q(j|i) equal to the posterior assignmentsˆp(j|i) =pjp(xi|µj, Σj)Pj0=1,2pj0p(xi|µj0, Σj0)=pjp(xi|µj, Σj)p(xi|θ)we getJ(ˆp; θ) =XiXj=1,2ˆp(j|i) logpjp(xi|µj, Σj)ˆp(j|i)=XiXj=1,2ˆp(j|i) logp(xi|θ)=Xilog p(xi|θ) = l(D; θ)Tommi Jaakkola, MIT CSAIL 21Completing the argument• The algorithm and propertiesq(k)= argmaxqJ(q; θ(k)) (E-step)θ(k+1)= argmaxθJ(q(k); θ) (M-step)J(q; θ) ≤ l(D; θ) (property 1)J(ˆp; θ) = l(D; θ) (property 2)• We can now set up the following chain of inequalitiesl(D; θ(k)) = J(q(k); θ(k)) (E-step, prop. 2)Tommi Jaakkola, MIT CSAIL 22Completing the argument• The algorithm and propertiesq(k)= argmaxqJ(q; θ(k)) (E-step)θ(k+1)= argmaxθJ(q(k); θ) (M-step)J(q; θ) ≤ l(D; θ) (property 1)J(ˆp; θ) = l(D; θ) (property 2)• We can now set up the following chain of inequalitiesl(D; θ(k)) = J(q(k); θ(k)) (E-step, prop. 2)≤ J(q(k); θ(k+1))


View Full Document

MIT 6 867 - Machine learning: lecture 14

Download Machine learning: lecture 14
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Machine learning: lecture 14 and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Machine learning: lecture 14 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?