Machine Learning ! !! ! ! Srihari 1 Mixture Models and EM Sargur Srihari [email protected] Learning ! !! ! ! Srihari 2 Plan of discussion • Examples of mixture models • K-means algorithm for finding clusters in a data set • Latent variable view of mixture distributions • Assigning data points to specific components of mixture • General technique for finding m.l. estimators in latent variable models • Expectation Maximization (EM) algorithm • EM Algorithm • Gaussian mixture models motivates EM • Latent variable viewpoint • K-means seen as non-probabilistic limit of EM applied to mixture of Gaussians • EM in generality – Infinite Mixture ModelsMachine Learning ! !! ! ! Srihari Definition • Probabilistic model representing sub-populations within a population – Without requiring that the sub-population of the data items be identified • Constructing such models is called unsupervised learning or clustering 3Machine Learning ! !! ! ! Srihari Example of a Mixture Model • Three-component model for a single variable with estimated densities 4Machine Learning ! !! ! ! Srihari More Examples • Also Called as Finite Mixture Models (FMMs) 5Machine Learning ! !! ! ! Srihari Bernoulli Mixture Model • Handwritten Digit Data • Mixture Model for 0-4 identifies 12 components 6Machine Learning ! !! ! ! Srihari 7 Role of Mixture Models • Mixture models provide: – Framework for building complex probability distributions – A method for clustering data • Viewed statistically as follows: – Complex distribution expressed in terms of more tractable joint distribution of observed and latent variables – Distribution of observed variables alone is obtained by marginalizationMachine Learning ! !! ! ! Srihari 8 K-means Clustering • Given data set {x1,..,xN} in D-dimensional Euclidean space • Partition into K clusters, which is given • One of K coding • Indicator variable rnk ∈ {0,1} where k =1,..,K – Describes which of K clusters data point xn is assigned toMachine Learning ! !! ! ! Srihari 9 Distortion measure • Sum of squared errors • Goal is to find values for {rnk} and the { } so as to minimize J – Can be done by iterative procedure – Each iteration has two steps • Successive optimization w.r.t. rnk and J = rnkk=1K∑n=1N∑|| xn−µk||2µµkµkMachine Learning ! !! ! ! Srihari 10 Two Updating Stages • First choose initial values for • First phase: – minimize J w.r.t. rnk keeping fixed • Second phase: – minimize J w.r.t. keeping rnk fixed • Two stages correspond to E (expectation) and M (maximization) of EM algorithm – Expectation: what is the expected class? – Maximization: what is the mle of the mean? µkµkMachine Learning ! !! ! ! Srihari 11 E: Determination of Indicator rnk • Because is a linear function of rnk this optimization is performed easily (closed form solution) • Terms involving different n are independent – Therefore can optimize for each n separately – Choosing rnk to be 1 for whichever value of k gives minimum value of ||xn- ||2 • Thus • Interpretation: – Assign xn to cluster whose mean is closest 211|| x ||NKnk n knkJrµ===−∑∑ rnk=1 if k = argminj|| xn−µj||20 otherwise⎧ ⎨ ⎩ µkMachine Learning ! !! ! ! Srihari 12 M: Optimization of • Hold rnk fixed • Objective function is a quadratic function of • Minimized by setting derivative w.r.t. to zero • Thus • Which is solved to give • Interpretation: – Set to mean of all data points xn assigned to cluster k xµnk nnknknrr=∑∑Equal to no of points assigned to cluster k 2 rnk(xn−µkn=1N∑) = 0211|| x ||NKnk n knkJrµ===−∑∑µkµkµkµkMachine Learning ! !! ! ! Srihari 13 Termination of K-Means • Two phases – re-assigning data points to clusters – Re-computing means of clusters • Done repeatedly until no further change in assignments • Since each phase reduces J convergence is assured • May converge to local minimum of JMachine Learning ! !! ! ! Srihari 14 Illustration of K-means • Old Faithful dataset • Single Gaussian is a poor fit • We choose K = 2 • Data set is standardized so each variable has zero mean and unit standard deviation • Assignment of each data point to nearest cluster center is equivalent to – which side of the perpendicular bisector of line joining cluster centersMachine Learning ! !! ! ! Srihari 15 K-means iterations E step M step E step E step E step M step M step M step Initial Choice of Means (Parameters) Final Clusters And Means E step: parameters are fixed Distributions are optimized M step: distributions are fixed Parameters are optimizedMachine Learning ! !! ! ! Srihari 16 Cost Function after Iteration • J for Old Faithful Data • Poor initial value chosen for cluster centers – Several steps needed for convergence – Better choice is to assign mk to random subset of k data points • K-means is itself used to initialize parameters for Gaussian mixture model before applying EMMachine Learning ! !! ! ! Srihari 17 Implementation of K-means • Direct implementation can be slow – In E step Euclidean distances are computed between every mean and every data point • ||xn- ||2 is computed for n=1,..N and k=1,..K • Faster implementations exist – Precomputing trees where nearby points are on same sub-tree – Use of triangle inequality to avoid unnecessary distance calculation µkMachine Learning ! !! ! ! Srihari 18 On-line Stochastic Version • Instead of batch processing entire data set • Apply Robbins-Monro procedure – To finding roots of the regression function given by
View Full Document