Toronto CSC 2515 - Lecture 5 - Mixture models, EM and Variational Inference

Unformatted text preview:

CSC2515 Fall 2007 Introduction to Machine Learning Lecture 5: Mixture models, EM and variational inferenceOverviewClusteringThe k-means algorithmWhy K-means convergesLocal minimaSoft k-meansA generative view of clusteringThe mixture of Gaussians generative modelFitting a mixture of GaussiansThe E-step: Computing responsibilitiesThe M-step: Computing new mixing proportionsMore M-step: Computing the new meansMore M-step: Computing the new variancesHow do we know that the updates improve things?Why EM convergesThe expected energy of a datapointThe entropy termThe E-step chooses the assignment probabilities that minimize the cost function (with the parameters of the Gaussians held fixed)The M-step chooses the parameters that minimize the cost function (with the assignment probabilities held fixed)The advantage of using F to understand EMAn incremental EM algorithmBeyond Mixture models: Directed Acyclic Graphical modelsWays to define the conditional probabilitiesWhat is easy and what is hard in a DAG?Explaining awayAn apparently crazy ideaApproximate inferenceA trade-off between how well the model fits the data and the accuracy of inferenceSlide 30Two ways to derive FAn MDL approach to clusteringHow many bits must we send?Using a Gaussian agreed distributionWhat is the best variance to use?Sending a value assuming a mixture of two equal GaussiansThe bits-back argumentUsing another message to make random decisionsThe general caseWhat is the best distribution?Free EnergyA Canadian exampleEM as coordinate descent in Free EnergyStochastic MDL using the wrong distribution over codesHow many components does a mixture need?CSC2515 Fall 2007 Introduction to Machine LearningLecture 5: Mixture models, EM and variational inferenceAll lecture slides will be available as .ppt, .ps, & .htm atwww.cs.toronto.edu/~hintonMany of the figures are provided by Chris Bishop from his textbook: ”Pattern Recognition and Machine Learning”Overview•Clustering with K-means and a proof of convergence that uses energies.•Clustering with a mixture of Gaussians and a proof of convergence that uses free energies•The MDL view of clustering and the bits-back argument•The MDL justification for incorrect inference.Clustering•We assume that the data was generated from a number of different classes. The aim is to cluster data from the same class together.–How do we decide the number of classes?–Why not put each datapoint into a separate class?•What is the objective function that is optimized by sensible clusterings?The k-means algorithm•Assume the data lives in a Euclidean space.•Assume we want k classes.•Assume we start with randomly located cluster centers The algorithm alternates between two steps: Assignment step: Assign each datapoint to the closest cluster. Refitting step: Move each cluster center to the center of gravity of the data assigned to it.AssignmentsRefitted meansWhy K-means converges •Whenever an assignment is changed, the sum squared distances of datapoints from their assigned cluster centers is reduced.•Whenever a cluster center is moved the sum squared distances of the datapoints from their currently assigned cluster centers is reduced.•Test for convergence: If the assignments do not change in the assignment step, we have converged.Local minima•There is nothing to prevent k-means getting stuck at local minima.•We could try many random starting points•We could try non-local split-and-merge moves: Simultaneously merge two nearby clusters and split a big cluster into two.A bad local optimumSoft k-means•Instead of making hard assignments of datapoints to clusters, we can make soft assignments. One cluster may have a responsibility of .7 for a datapoint and another may have a responsibility of .3. –Allows a cluster to use more information about the data in the refitting step.–What happens to our convergence guarantee?–How do we decide on the soft assignments?A generative view of clustering•We need a sensible measure of what it means to cluster the data well.–This makes it possible to judge different methods. –It may make it possible to decide on the number of clusters.•An obvious approach is to imagine that the data was produced by a generative model.–Then we can adjust the parameters of the model to maximize the probability that it would produce exactly the data we observed.The mixture of Gaussians generative model•First pick one of the k Gaussians with a probability that is called its “mixing proportion”.•Then generate a random point from the chosen Gaussian.•The probability of generating the exact data we observed is zero, but we can still try to maximize the probability density. –Adjust the means of the Gaussians–Adjust the variances of the Gaussians on each dimension.–Adjust the mixing proportions of the Gaussians.Fitting a mixture of Gaussians The EM algorithm alternates between two steps:E-step: Compute the posterior probability that each Gaussian generates each datapoint.M-step: Assuming that the data really was generated this way, change the parameters of each Gaussian to maximize the probability that it would generate the data it is currently responsible for..95.5.5.05.5.5.95.05The E-step: Computing responsibilities•In order to adjust the parameters, we must first solve the inference problem: Which Gaussian generated each datapoint?–We cannot be sure, so it’s a distribution over all possibilities.•Use Bayes theorem to get posterior probabilities 2,2,1,2||||21)|()()|()()()()|()()|(didicdDdddicijcccccxipipjpjpppipipipexxxxxxPosterior for Gaussian iPrior for Gaussian iMixing proportionProduct over all data dimensionsBayes theoremThe M-step: Computing new mixing proportions•Each Gaussian gets a certain amount of posterior probability for each datapoint.•The optimal mixing proportion to use (given these posterior probabilities) is just the fraction of the data that the Gaussian gets responsibility for.NipNcccnewi1)|( xData for training case cNumber of training casesPosterior for Gaussian iMore M-step: Computing the new means•We just take the center-of gravity of the data that the Gaussian is responsible for.–Just like in K-means, except the data is weighted by the posterior probability of the Gaussian.–Guaranteed to lie in the convex hull of the data•Could be big initial jumpcccccnewiipip)|()|(xxxμMore M-step: Computing the


View Full Document

Toronto CSC 2515 - Lecture 5 - Mixture models, EM and Variational Inference

Download Lecture 5 - Mixture models, EM and Variational Inference
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture 5 - Mixture models, EM and Variational Inference and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture 5 - Mixture models, EM and Variational Inference 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?