UB CSE 574 - Mixture Models and EM - D2258651

Home> Schools> University at Buffalo, The State University of New York> Computer Science & Engineering (CSE) > CSE 574> Mixture Models and EM

DOC PREVIEW

UB CSE 574 - Mixture Models and EM

School name University at Buffalo, The State University of New York

Course Cse 574- Introduction to Machine Learning

Pages 23

This preview shows page 1-2-22-23 out of 23 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 23 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 23 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 23 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 23 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 23 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Machine Learning ! !! ! ! Srihari 1 Mixture Models and EM Sargur Srihari [email protected] Learning ! !! ! ! Srihari 2 Plan of discussion • Examples of mixture models • K-means algorithm for finding clusters in a data set • Latent variable view of mixture distributions • Assigning data points to specific components of mixture • General technique for finding m.l. estimators in latent variable models • Expectation Maximization (EM) algorithm • EM Algorithm • Gaussian mixture models motivates EM • Latent variable viewpoint • K-means seen as non-probabilistic limit of EM applied to mixture of Gaussians • EM in generality – Infinite Mixture ModelsMachine Learning ! !! ! ! Srihari Definition • Probabilistic model representing sub-populations within a population – Without requiring that the sub-population of the data items be identified • Constructing such models is called unsupervised learning or clustering 3Machine Learning ! !! ! ! Srihari Example of a Mixture Model • Three-component model for a single variable with estimated densities 4Machine Learning ! !! ! ! Srihari More Examples • Also Called as Finite Mixture Models (FMMs) 5Machine Learning ! !! ! ! Srihari Bernoulli Mixture Model • Handwritten Digit Data • Mixture Model for 0-4 identifies 12 components 6Machine Learning ! !! ! ! Srihari 7 Role of Mixture Models • Mixture models provide: – Framework for building complex probability distributions – A method for clustering data • Viewed statistically as follows: – Complex distribution expressed in terms of more tractable joint distribution of observed and latent variables – Distribution of observed variables alone is obtained by marginalizationMachine Learning ! !! ! ! Srihari 8 K-means Clustering • Given data set {x1,..,xN} in D-dimensional Euclidean space • Partition into K clusters, which is given • One of K coding • Indicator variable rnk ∈ {0,1} where k =1,..,K – Describes which of K clusters data point xn is assigned toMachine Learning ! !! ! ! Srihari 9 Distortion measure • Sum of squared errors • Goal is to find values for {rnk} and the { } so as to minimize J – Can be done by iterative procedure – Each iteration has two steps • Successive optimization w.r.t. rnk and J = rnkk=1K∑n=1N∑|| xn−µk||2µµkµkMachine Learning ! !! ! ! Srihari 10 Two Updating Stages • First choose initial values for • First phase: – minimize J w.r.t. rnk keeping fixed • Second phase: – minimize J w.r.t. keeping rnk fixed • Two stages correspond to E (expectation) and M (maximization) of EM algorithm – Expectation: what is the expected class? – Maximization: what is the mle of the mean? µkµkMachine Learning ! !! ! ! Srihari 11 E: Determination of Indicator rnk • Because is a linear function of rnk this optimization is performed easily (closed form solution) • Terms involving different n are independent – Therefore can optimize for each n separately – Choosing rnk to be 1 for whichever value of k gives minimum value of ||xn- ||2 • Thus • Interpretation: – Assign xn to cluster whose mean is closest 211|| x ||NKnk n knkJrµ===−∑∑  rnk=1 if k = argminj|| xn−µj||20 otherwise⎧ ⎨ ⎩ µkMachine Learning ! !! ! ! Srihari 12 M: Optimization of • Hold rnk fixed • Objective function is a quadratic function of • Minimized by setting derivative w.r.t. to zero • Thus • Which is solved to give • Interpretation: – Set to mean of all data points xn assigned to cluster k xµnk nnknknrr=∑∑Equal to no of points assigned to cluster k  2 rnk(xn−µkn=1N∑) = 0211|| x ||NKnk n knkJrµ===−∑∑µkµkµkµkMachine Learning ! !! ! ! Srihari 13 Termination of K-Means • Two phases – re-assigning data points to clusters – Re-computing means of clusters • Done repeatedly until no further change in assignments • Since each phase reduces J convergence is assured • May converge to local minimum of JMachine Learning ! !! ! ! Srihari 14 Illustration of K-means • Old Faithful dataset • Single Gaussian is a poor fit • We choose K = 2 • Data set is standardized so each variable has zero mean and unit standard deviation • Assignment of each data point to nearest cluster center is equivalent to – which side of the perpendicular bisector of line joining cluster centersMachine Learning ! !! ! ! Srihari 15 K-means iterations E step M step E step E step E step M step M step M step Initial Choice of Means (Parameters) Final Clusters And Means E step: parameters are fixed Distributions are optimized M step: distributions are fixed Parameters are optimizedMachine Learning ! !! ! ! Srihari 16 Cost Function after Iteration • J for Old Faithful Data • Poor initial value chosen for cluster centers – Several steps needed for convergence – Better choice is to assign mk to random subset of k data points • K-means is itself used to initialize parameters for Gaussian mixture model before applying EMMachine Learning ! !! ! ! Srihari 17 Implementation of K-means • Direct implementation can be slow – In E step Euclidean distances are computed between every mean and every data point • ||xn- ||2 is computed for n=1,..N and k=1,..K • Faster implementations exist – Precomputing trees where nearby points are on same sub-tree – Use of triangle inequality to avoid unnecessary distance calculation µkMachine Learning ! !! ! ! Srihari 18 On-line Stochastic Version • Instead of batch processing entire data set • Apply Robbins-Monro procedure – To finding roots of the regression function given by

View Full Document


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2-22-23 out of 23 pages.

UB CSE 574 - Mixture Models and EM

Sign up for free to view:

Please select your school