6 867 Machine learning lecture 17 Jaakkola 1 Lecture topics Mixture models and clustering k means Distance and clustering Mixture models and clustering We have so far used mixture models as exible ways of constructing probability models for prediction tasks The motivation behind the mixture model was that the available data may include unobserved sub groups and by incorporating such structure in the model we could obtain more accurate predictions We are also interested in uncovering that group structure This is a clustering problem For example in the case of modeling exam scores it would be useful to understand what types of students there are In a biological context it would be useful to uncover which genes are active expressed in which cell types when the measurements are from tissue samples involving multiple cell types in unknown quantities Clustering problems are ubiquitous There are many di erent types of clustering algorithms Some generate a series of nested clusters by merging simple clusters into larger ones hierarchical agglomerative clustering while others try to nd a pre speci ed number of clusters that best capture the data e g k means Which algorithm is the most appropriate to use depends in part on what we are looking for So it is very useful to know more than one clustering method Mixture models as generative models require us to articulate the type of clusters or sub groups we are looking to identify The simplest type of clusters we could look for are spherical Gaussian clusters i e we would be estimating Gaussian mixtures of the form P x m m P j N x j j2 I 1 j 1 where the parameters include P j j and j2 Note that we are estimating the mixture models with a di erent objective in mind We are more interested in nding where the clusters are than how good the mixture model is as a generative model There are many questions to resolve For example how many such spherical Gaussian clusters are there in the data This is a model selection problem If such clusters exist do we have enough data to identify them If we have enough data can we hope to nd the clusters via the EM algorithm Is our approach robust i e does our method degrade Cite as Tommi Jaakkola course materials for 6 867 Machine Learning Fall 2006 MIT OpenCourseWare http ocw mit edu Massachusetts Institute of Technology Downloaded on DD Month YYYY 6 867 Machine learning lecture 17 Jaakkola 2 gracefully when data contain background samples or impurities along with spherical Gaussian clusters Can we make the clustering algorithm more robust Similar questions apply to other clustering algorithms We will touch on some of these issues in the context of each algorithm Mixtures and K means It is often helpful to try to search for simple clusters For example consider a mixture of two Gaussians where the variances of the spherical Gaussians may be di erent P x 2 P j N x j j2 I 2 j 1 Points far away from the two means in whatever direction would always be assigned to the Gaussian with the larger variance tails of that distribution approach zero much slower the posterior assignment is based on the ratio of probabilities While this is perfectly ne for modeling the density it does not quite agree with the clustering goal To avoid such issues and to keep the discussion simpler we will restrict the covariance matrices to be all identical and equal to 2 I where 2 is common to all clusters Moreover we will x the mixing proportions to be uniform P j 1 m With such restrictions we can more easily understand the type of clusters we will discover The resulting simpli ed mixture model has the form m 1 P x N x j 2 I m j 1 3 Let s begin by trying to understand how points are assigned to clusters within the EM algorithm To this end we can see how the mixture model partitions the space into regions where within each region a speci c component has the highest posterior prob ability To start with consider any two components i and j and the boundary where P i x P j x i e the set of points for which the component i and j have the same posterior probability Since the Gaussians have equal prior probabilities this happens when N x j 2 I N x i 2 I Both Gaussians have the same spherical covariance matrix so the probability that they assign to points is based on the Euclidean distances to their mean vectors The posteriors therefore can be equal only when the component means are equidistant from the points x i 2 x j 2 or 2xT j i j 2 j 2 4 Cite as Tommi Jaakkola course materials for 6 867 Machine Learning Fall 2006 MIT OpenCourseWare http ocw mit edu Massachusetts Institute of Technology Downloaded on DD Month YYYY 6 867 Machine learning lecture 17 Jaakkola 3 The boundary is therefore linear1 in x We can draw such boundaries between any pair of components as illustrated in Figure 1 The pairwise comparisons induce a Voronoi partition of the space where for example the region enclosing 1 in the gure corresponds to all the points whose closest mean is 1 This is also the region where P 1 x takes the highest value among all the posterior probabilities 1 2 x 2 x 1 2 3 x 3 1 3 Figure 1 Voronoi regions resulting from pairwise comparison of posterior assignment prob abilities Each line corresponds to a decision between posterior probabilities of two compo nents The line is highlighted solid when the boundary is an active constraint for deciding the highest probability component Note that the posterior assignment probabilities P j x evaluated in the E step of the EM algorithm do not vanish across the boundaries The regions merely highlight when one assignment has a higher probability than the others The overall variance parameter 2 controls how sharply the posterior probabilities change when we cross each boundary Small 2 results in very sharp transitions e g from near zero to nearly one while the posteriors change smoothly when 2 is large K means We can de ne a simpler and faster version of the EM algorithm by just assigning each point to the component with the highest posterior probability its closest mean Geometrically the points within each Voronoi region are assigned to one component The M step then simply repositions each mean vector to the center of the points within the region The updated means de ne new Voronoi regions and so on The resulting algorithm for nding cluster means is known as the K means algorithm E step assign each point xt to its closest mean i e jt arg minj xt j 2 M step recompute j s as means of the assigned points 1 The linearity
View Full Document
Unlocking...