DOC PREVIEW
MIT 6 867 - Mixture models and clustering

This preview shows page 1-2-3 out of 8 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

� 1 6.867 Machine learning, lecture 17 (Jaakkola) Lecture topics: • Mixture models and clustering, k-means • Distance and clustering Mixture models and clustering We have so far used mixture models as flexible ways of constructing probability models for prediction tasks. The motivation behind the mixture model was that the available data may include unobserved (sub)groups and by incorporating such structure in the model we could obtain more accurate predictions. We are also interested in uncovering that group structure. This is a clustering problem. For example, in the case of modeling exam scores it would be useful to understand what types of students there are. In a biological context, it would be useful to uncover which genes are active (expressed) in which cell types when the measurements are from tissue samples involving multiple cell types in unknown quantities. Clustering problems are ubiquitous. There are many different types of clustering algorithms. Some generate a series of nested clusters by merging simple clusters into larger ones (hierarchical agglomerative clustering), while others try to find a pre-specified number of clusters that best capture the data (e.g., k-means). Which algorithm is the most appropriate to use depends in part on what we are looking for. So it is very useful to know more than one clustering method. Mixture models as generative models require us to articulate the type of clusters or sub-groups we are looking to identify. The simplest type of clusters we could look for are spherical Gaussian clusters, i.e., we would be estimating Gaussian mixtures of the form mP (x; θ, m) = P (j)N(x; µj , σj 2I) (1) j=1 where the parameters θ include {P (j)}, {µj }, and {σj 2}. Note that we are estimating the mixture models with a different objective in mind. We are more interested in finding where the clusters are than how good the mixture model is as a generative model. There are many ques tions to resolve. For example, how many such spherical Gaussian clusters are there in the data? This is a model selection problem. If such clusters exist, do we have enough data to identify them? If we have enough data, can we hope to find the clusters via the EM algorithm? Is our approach robust, i.e., does our method degrade Cite as: Tommi Jaakkola, course materials for 6.867 Machine Learning, Fall 2006. MIT OpenCourseWare(http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].� 2 6.867 Machine learning, lecture 17 (Jaakkola) gracefully when data contain “background samples” or impurities along with spherical Gaussian clusters? Can we make the clustering algorithm more robust? Similar questions apply to other clustering algorithms. We will touch on some of these issues in the context of each algorithm. Mixtures and K-means It is often helpful to try to search for s imple c lusters. For example, consider a mixture of two Gaussians where the variances of the spherical Gaussians may be different: 2P (x; θ) = P (j)N(x; µj , σj 2I) (2) j=1 Points far away from the two means, in whatever direction, would always b e assigned to the Gaussian with the larger variance (tails of that distribution approach zero much slower; the posterior assignment is based on the ratio of probabilities). While this is perfectly fine for modeling the density, it does not quite agree with the clustering goal. To avoid such issues (and to keep the discussion simpler) we will restrict the covariance matrices to be all identical and equal to σ2I where σ2 is common to all clusters. Moreover, we will fix the mixing proportions to be uniform P (j) = 1/m. With such restrictions we can more easily understand the type of clusters we will discover. The resulting simplified mixture model has the form m1 � P (x; θ) = N(x; µj , σ2I) (3) m j=1 Let’s begin by trying to understand how points are assigned to clusters within the EM algorithm. To this end, we can see how the mixture mo de l partitions the space into regions where within each region a specific component has the highest posterior prob-ability. To start with, consider any two components i and j and the boundary where P (i|x, θ) = P (j|x, θ), i.e., the set of points for which the component i and j have the same posterior probability. Since the Gaussians have equal prior probabilities, this happens when N(x; µj , σ2I) = N(x; µi, σ2I). Both Gaussians have the same spherical covariance matrix so the probability that they assign to points is based on the Euclidean distances to their mean vectors. The posteriors therefore can be equal only when the component means are equidistant from the points: 2 2 2 2�x − µi� = �x − µj � or 2x T (µj − µi) = �µj � − �µj � (4) Cite as: Tommi Jaakkola, course materials for 6.867 Machine Learning, Fall 2006. MIT OpenCourseWare(http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].3 6.867 Machine learning, lecture 17 (Jaakkola) The boundary is therefore linear1 in x. We can draw such boundaries between any pair of components as illustrated in Figure 1. The pairwise comparisons induce a Voronoi partition of the space where, for example, the region enclosing µ1 in the figure corresponds to all the points whose closest mean is µ1. This is also the region where P (1|x, θ) takes the highest value among all the posterior probabilities. µ2 2x 21 3 x µ1 x µ3 1 3 Figure 1: Voronoi regions resulting from pairwise comparison of posterior assignment prob-abilities. Each line corresponds to a decision between posterior probabilities of two compo-nents. The line is highlighted (solid) when the boundary is an active constraint for deciding the highest probability component. Note that the posterior assignment probabilities P (j|x, θ) evaluated in the E-step of the EM algorithm do not vanish across the boundaries. The regions merely highlight when one assignment has a higher probability than the others. The overall variance parameter σ2 controls how sharply the posterior probabilities change when we cross each boundary. Small σ2 results in very sharp transitions (e.g., from near zero to nearly one) while the posteriors change smoothly when σ2 is large. K-means. We can define a simpler and faster version of the EM algorithm by just


View Full Document

MIT 6 867 - Mixture models and clustering

Download Mixture models and clustering
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Mixture models and clustering and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Mixture models and clustering 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?