MIT 6 867 - Mixture models and clustering - D2181693

Home> Schools> Massachusetts Institute of Technology> Electrical Engineering and Computer Science (6) > 6 867> Mixture models and clustering

DOC PREVIEW

MIT 6 867 - Mixture models and clustering

School name Massachusetts Institute of Technology

Course 6 867- Machine Learning

Pages 8

This preview shows page 1-2-3 out of 8 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 8 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 8 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 8 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 8 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

� 1 6.867 Machine learning, lecture 17 (Jaakkola) Lecture topics: • Mixture models and clustering, k-means • Distance and clustering Mixture models and clustering We have so far used mixture models as ﬂexible ways of constructing probability models for prediction tasks. The motivation behind the mixture model was that the available data may include unobserved (sub)groups and by incorporating such structure in the model we could obtain more accurate predictions. We are also interested in uncovering that group structure. This is a clustering problem. For example, in the case of modeling exam scores it would be useful to understand what types of students there are. In a biological context, it would be useful to uncover which genes are active (expressed) in which cell types when the measurements are from tissue samples involving multiple cell types in unknown quantities. Clustering problems are ubiquitous. There are many diﬀerent types of clustering algorithms. Some generate a series of nested clusters by merging simple clusters into larger ones (hierarchical agglomerative clustering), while others try to ﬁnd a pre-speciﬁed number of clusters that best capture the data (e.g., k-means). Which algorithm is the most appropriate to use depends in part on what we are looking for. So it is very useful to know more than one clustering method. Mixture models as generative models require us to articulate the type of clusters or sub-groups we are looking to identify. The simplest type of clusters we could look for are spherical Gaussian clusters, i.e., we would be estimating Gaussian mixtures of the form mP (x; θ, m) = P (j)N(x; µj , σj 2I) (1) j=1 where the parameters θ include {P (j)}, {µj }, and {σj 2}. Note that we are estimating the mixture models with a diﬀerent objective in mind. We are more interested in ﬁnding where the clusters are than how good the mixture model is as a generative model. There are many ques tions to resolve. For example, how many such spherical Gaussian clusters are there in the data? This is a model selection problem. If such clusters exist, do we have enough data to identify them? If we have enough data, can we hope to ﬁnd the clusters via the EM algorithm? Is our approach robust, i.e., does our method degrade Cite as: Tommi Jaakkola, course materials for 6.867 Machine Learning, Fall 2006. MIT OpenCourseWare(http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].� 2 6.867 Machine learning, lecture 17 (Jaakkola) gracefully when data contain “background samples” or impurities along with spherical Gaussian clusters? Can we make the clustering algorithm more robust? Similar questions apply to other clustering algorithms. We will touch on some of these issues in the context of each algorithm. Mixtures and K-means It is often helpful to try to search for s imple c lusters. For example, consider a mixture of two Gaussians where the variances of the spherical Gaussians may be diﬀerent: 2P (x; θ) = P (j)N(x; µj , σj 2I) (2) j=1 Points far away from the two means, in whatever direction, would always b e assigned to the Gaussian with the larger variance (tails of that distribution approach zero much slower; the posterior assignment is based on the ratio of probabilities). While this is perfectly ﬁne for modeling the density, it does not quite agree with the clustering goal. To avoid such issues (and to keep the discussion simpler) we will restrict the covariance matrices to be all identical and equal to σ2I where σ2 is common to all clusters. Moreover, we will ﬁx the mixing proportions to be uniform P (j) = 1/m. With such restrictions we can more easily understand the type of clusters we will discover. The resulting simpliﬁed mixture model has the form m1 � P (x; θ) = N(x; µj , σ2I) (3) m j=1 Let’s begin by trying to understand how points are assigned to clusters within the EM algorithm. To this end, we can see how the mixture mo de l partitions the space into regions where within each region a speciﬁc component has the highest posterior prob-ability. To start with, consider any two components i and j and the boundary where P (i|x, θ) = P (j|x, θ), i.e., the set of points for which the component i and j have the same posterior probability. Since the Gaussians have equal prior probabilities, this happens when N(x; µj , σ2I) = N(x; µi, σ2I). Both Gaussians have the same spherical covariance matrix so the probability that they assign to points is based on the Euclidean distances to their mean vectors. The posteriors therefore can be equal only when the component means are equidistant from the points: 2 2 2 2�x − µi� = �x − µj � or 2x T (µj − µi) = �µj � − �µj � (4) Cite as: Tommi Jaakkola, course materials for 6.867 Machine Learning, Fall 2006. MIT OpenCourseWare(http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].3 6.867 Machine learning, lecture 17 (Jaakkola) The boundary is therefore linear1 in x. We can draw such boundaries between any pair of components as illustrated in Figure 1. The pairwise comparisons induce a Voronoi partition of the space where, for example, the region enclosing µ1 in the ﬁgure corresponds to all the points whose closest mean is µ1. This is also the region where P (1|x, θ) takes the highest value among all the posterior probabilities. µ2 2x 21 3 x µ1 x µ3 1 3 Figure 1: Voronoi regions resulting from pairwise comparison of posterior assignment prob-abilities. Each line corresponds to a decision between posterior probabilities of two compo-nents. The line is highlighted (solid) when the boundary is an active constraint for deciding the highest probability component. Note that the posterior assignment probabilities P (j|x, θ) evaluated in the E-step of the EM algorithm do not vanish across the boundaries. The regions merely highlight when one assignment has a higher probability than the others. The overall variance parameter σ2 controls how sharply the posterior probabilities change when we cross each boundary. Small σ2 results in very sharp transitions (e.g., from near zero to nearly one) while the posteriors change smoothly when σ2 is large. K-means. We can deﬁne a simpler and faster version of the EM algorithm by just

View Full Document


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2-3 out of 8 pages.

MIT 6 867 - Mixture models and clustering

Sign up for free to view:

Please select your school