Princeton COS 424 - Mixture Models and Expectation-Maximization - D2891657

Home> Schools> Princeton University> Computer Science (COS) > COS 424> Mixture Models and Expectation-Maximization

Princeton COS 424 - Mixture Models and Expectation-Maximization

Pages 6

Download Save

Unformatted text preview:

Mixture Models and Expectation-MaximizationDavid M. BleiMarch 9, 2012EM for mixtures of multinomials• The graphical model for a mixture of multinomialsxdnDNπKθkzd• How should we fit the parameters? Maximum likelihood.• There is an issue5logp(x1:D|π,θ1:K) =DXd =1logp(xd|π,θ1:K) (1)=DXd =1logKXzd=1p(zd|π)p(xd|π,θ1:K) (2)• In the second line, the log cannot help simplify the summation.• Why is that summation there? Because zdis a hidden variable.•Ifzdwere observed, what happens? Recall classification. The likelihood would decompose.•In general, fitting hidden variable models with maximum likelihood is hard. THe EMalgorithm can help. (But there are exceptions in both ways: some hidden variable models10don’t require EM and for others EM is not enough.)1•The expectation-maximimization algorithm is a general-purpose technique for fittingparameters in models with hidden variables.• Here is the algorithm for mixtures (in English)• REPEAT:151.For each documentd, compute the conditional distribution of its cluster assignmentzdgiven the current setting of the parameters π(t )and θ(t )1:K.2.Obtainπ(t +1)andθ(t +1)1:Kby computing “MLE”s for each cluster, but weighting eachdata point by its posterior probability of being in that cluster.• Notice that this resembles k -means. What are the differences?20•This strategy simply works. You can apply it in all the settings that we saw at the end of theslides. But we need to be precise about what we mean by “weighting.”• REPEAT:1. E-step: For each document d , compute λd= p (zd|π(t ),θ(t )1:K)2. M-step: For each cluster label’s distribution over terms25θ(t +1)k ,v=PDd =1λd ,knv(xd)PDd =1λd ,kn(xd)(3)and for the cluster proportionsπ(t +1)k=PDd =1λd ,kD(4)• Do we know how to do the E-step?p(zd= k | π(t ),θ(t )1:K) ∝ πkVYv =1θnv(xd)k ,v. (5)This is precisely the prediction computation from classification (e.g., from “is this emailSpam or Ham?” to “is this email class 1,2,...,10?”) We emphasize that the parameters (inthe E-step) are fixed.30• How about the M-step for cluster-conditional distributions θk?–The numerator is the expected number of times I saw wordvin classk. What is theexpectation taken with respect to? The posterior distribution of the hidden variable,the variable that determines which cluster parameter each word was generated from.2– The denominator is the expected number of words (total) that came from class k .35• How about the M-step for cluster proportions π?– The numerator is the expected number of times I saw a document from cluster k .– The denominator is the number of documents• These are like traditional MLEs, but with expected counts. This is how EM works.• It is actually maximizing the likelihood. Let’s see why.40Mixtures of Gaussians• The mixture of Gaussians model is equivalent, except xdis now a Gaussian observation.• The mixture components are means µ1:Kand variances σ21:K. The proportions are π.• Mixtures of Gaussians can express different kinds of distributions– Multimodal45– Heavy-tailed•As a density estimator, arbitrarily complex densities can be approximated with a mixtureof Gaussians (given enough components).• We fit these mixtures with EM.•The E-step is conceptually the same. With the current setting of the parameters, compute50the conditional distribution of the hidden variable zdfor each data point xd,p(zd|xd,π,µ1:K,σ21:K) ∝ p(zd|π)p(xd|zd,µ1:K,σ21:K). (6)This is just like for the mixtures of multinomials, only the data are now Gaussian.• The M-step computes the empirical mean and variance for each component.– In µk, each point is weighted by its posterior probability of being in the cluster.– In σ2k, weight (xd−ˆµk)2by the same posterior probability.553• The complex density estimate comes out of the predictive distributionp(x |π,µ1:K,σ21:K) =Xzp(z |π)p(x |z ,µ1:K,σ21:K) (7)=KXk =1πkp(x |µk,σ2k) (8)This is K Gaussian bumps, each weighted by πk.Why does this work?• EM maximizes the likelihood.• Generic model60– The hidden variables are z .– The observations are x.– Tha parameters are θ .– The joint distribution is p (z ,x | θ ).• The complete log likelihood is65`c(θ ;x,z ) = log p (z ,x | θ ) (9)• For a distribution q(z ), the expected complete log likelihood is E[log p (Z ,x | θ )].• Jensen’s inequality: In for a concave function and λ ∈ (0,1),f (λx + (1 − λ)y ) ≥ λf (x) + (1 − λ)f (y ) (10)• Draw a picture with– x– y70– λx + (1 − λ)y– f (λx + (1 − λ)y )– λf (x) + (1 − λ)f (y )• This generalizes to probability distributions. For concave f ,f (E[X ]) ≥ E[f (X )]. (11)4• Jensen’s bound and L (θ ,q )75`(θ ;x) = logp(x | θ ) (12)= logXzp(x,z |θ ) (13)= logXzq(z )p(x,z |θ )q(z )(14)≥Xzq(z ) logp(x,z |θ )q(z )(15)= L (q,θ ). (16)This bound on the log likelihood depends on a distribution over the hidden variablesq(z)and on the parameters θ .•EM is a coordinate ascent algorithm. We optimizeLwith respect to each argument,holding the other parameter fixed.q(t +1)(z ) = argmaxqL (q,θ(t )) (17)θ(t +1)= argmaxθL (q(t +1),θ ) (18)• In the E-step, we find q∗(z ). This equals the posterior p(z |x,θ ),80L (p(z |x,θ ),θ ) =Xzp(z |x,θ ) logp(x,z |θ )p(z |x,θ )(19)=Xzp(z |x,θ ) logp (x | θ ) (20)= logp(x |θ ) (21)= `(θ ;x) (22)•In the M-step, we only need to worry about the expected complete log likelihood becauseL (q,θ ) =Xzq(z ) log p (x,z |θ ) −Xzq(z ) logq(z ). (23)The second term does not depend on the parameterθ. The first term is the expectedcomplete log likelihood.• Perspective of optimize bound; tighten bound• This finds a local optimum & is sensitive to starting points85• Test for convergence: After the E-step, we can compute the likelihood at the point θ(t ).5Big picture•With EM, we can fit all kinds of hidden variable models with maximum likelihood. EMhas had a huge impact on statistics, machine learning, signal processing, computer vision,natural language processing, speech recognition, genomics and other fields.90• Missing data was no longer a block. It opened up the idea of– Imagine I can observe everything. Then I can fit the MLE.– But I cannot observe everything. With EM, I can still try to fit the MLE.• For example– Used to “fill in” missing pixels in tomography.95– Modeling nonresponses in sample surveys.–Many applications in

View Full Document


School:
Email:
New Password:
Confirm Password:

Princeton COS 424 - Mixture Models and Expectation-Maximization

Sign up for free to view:

Please select your school