12005-2007 Carlos Guestrin1Unsupervised LearningClusteringK-meansMachine Learning – 10701/15781Carlos GuestrinCarnegie Mellon UniversityNovember 18th, 2009©2005-2009 Carlos Guestrin1Some Data©2005-2009 Carlos Guestrin22K-means1. Ask user how many clusters they’d like. (e.g. k=5) ©2005-2009 Carlos Guestrin3K-means1. Ask user how many clusters they’d like. (e.g. k=5) 2. Randomly guess k cluster Center locations©2005-2009 Carlos Guestrin43K-means1. Ask user how many clusters they’d like. (e.g. k=5) 2. Randomly guess k cluster Center locations3. Each datapoint finds out which Center it’s closest to. (Thus each Center “owns” a set of datapoints)©2005-2009 Carlos Guestrin5K-means1. Ask user how many clusters they’d like. (e.g. k=5) 2. Randomly guess k cluster Center locations3. Each datapoint finds out which Center it’s closest to.4. Each Center finds the centroid of the points it owns©2005-2009 Carlos Guestrin64K-means1. Ask user how many clusters they’d like. (e.g. k=5) 2. Randomly guess k cluster Center locations3. Each datapoint finds out which Center it’s closest to.4. Each Center finds the centroid of the points it owns…5. …and jumps there6. …Repeat until terminated!©2005-2009 Carlos Guestrin7K-means Randomly initialize k centers µ(0)= µ1(0),…, µk(0) Classify: Assign each point j∈∈∈∈{1,…m} to nearest center: Recenter: µibecomes centroid of its point: Equivalent to µi←←←← average of its points!©2005-2009 Carlos Guestrin85What is K-means optimizing? Potential function F(µµµµ,C) of centers µµµµ and point allocations C: Optimal K-means: minµµµµminCF(µµµµ,C) ©2005-2009 Carlos Guestrin9Does K-means converge??? Part 1 Optimize potential function: Fix µµµµ, optimize C©2005-2009 Carlos Guestrin106Does K-means converge??? Part 2 Optimize potential function: Fix C, optimize µµµµ©2005-2009 Carlos Guestrin11Coordinate descent algorithms Want: minaminbF(a,b) Coordinate descent: fix a, minimize b fix b, minimize a repeat Converges!!! if F is bounded to a (often good) local optimum as we saw in applet (play with it!) K-means is a coordinate descent algorithm!©2005-2009 Carlos Guestrin1272005-2007 Carlos GuestrinEMMachine Learning – 10701/15781Carlos GuestrinCarnegie Mellon UniversityNovember 18th, 2009©2005-2009 Carlos Guestrin1314(One) bad case for k-means Clusters may overlap Some clusters may be “wider” than others©2005-2009 Carlos Guestrin815Gaussian Bayes Classifier Reminder)()()|()|(jjjpiyPiypiyPxxx====P(y = i | xj) ∝1(2π)m / 2|| Σi||1/ 2exp −12xj−µi( )TΣi−1xj−µi( ) P(y = i)©2005-2009 Carlos Guestrin16Predicting wealth from age©2005-2009 Carlos Guestrin917Predicting wealth from age©2005-2009 Carlos Guestrin18Learning modelyear , mpg ---> maker Σ =σ21σ12Lσ1mσ12σ22Lσ2mM M O Mσ1mσ2mLσ2m ©2005-2009 Carlos Guestrin1019General: O(m2)parameters Σ =σ21σ12Lσ1mσ12σ22Lσ2mM M O Mσ1mσ2mLσ2m ©2005-2009 Carlos Guestrin20Aligned: O(m)parameters Σ =σ210 0 L 0 00σ220 L 0 00 0σ23L 0 0M M M O M M0 0 0 Lσ2m −100 0 0 L 0σ2m ©2005-2009 Carlos Guestrin1121Aligned: O(m)parameters Σ =σ210 0 L 0 00σ220 L 0 00 0σ23L 0 0M M M O M M0 0 0 Lσ2m −100 0 0 L 0σ2m ©2005-2009 Carlos Guestrin22Spherical: O(1)cov parameters Σ =σ20 0 L 0 00σ20 L 0 00 0σ2L 0 0M M M O M M0 0 0 Lσ200 0 0 L 0σ2 ©2005-2009 Carlos Guestrin1223Spherical: O(1)cov parameters Σ =σ20 0 L 0 00σ20 L 0 00 0σ2L 0 0M M M O M M0 0 0 Lσ200 0 0 L 0σ2 ©2005-2009 Carlos Guestrin24Next… back to Density EstimationWhat if we want to do density estimation with multimodal or clumpy data?©2005-2009 Carlos Guestrin1325But we don’t see class labels!!! MLE: argmax ∏jP(yj,xj) But we don’t know yj’s!!! Maximize marginal likelihood: argmax ∏jP(xj) = argmax ∏j∑i=1kP(yj=i,xj)©2005-2009 Carlos Guestrin26Special case: spherical Gaussians and hard assignments If P(X|Y=i) is spherical, with same σ for all classes: If each xjbelongs to one class C(j) (hard assignment), marginal likelihood: Same as K-means!!!P(xj| y = i) ∝ exp −12σ2xj−µi2 P(xj, y = i)i=1k∑j=1m∏∝ exp −12σ2xj−µC ( j )2 j =1m∏P(y = i | xj) ∝1(2π)m / 2|| Σi||1/ 2exp −12xj−µi( )TΣi−1xj−µi( ) P(y = i)©2005-2009 Carlos Guestrin1427The GMM assumption• There are k components• Component i has an associated mean vector µιµ1µ2µ3©2005-2009 Carlos Guestrin28The GMM assumption• There are k components• Component i has an associated mean vector µι Each component generates data from a Gaussian with mean mi and covariance matrixσ2ΙΙΙΙEach data point is generated according to the following recipe: µ1µ2µ3©2005-2009 Carlos Guestrin1529The GMM assumption• There are k components• Component i has an associated mean vector µι• Each component generates data from a Gaussian with mean mi and covariance matrix σ2ΙΙΙΙEach data point is generated according to the following recipe: 1. Pick a component at random: Choose component i with probability P(y=i)µ2©2005-2009 Carlos Guestrin30The GMM assumption• There are k components• Component i has an associated mean vector µι• Each component generates data from a Gaussian with mean mi and covariance matrix σ2ΙΙΙΙEach data point is generated according to the following recipe: 1. Pick a component at random: Choose component i with probability P(y=i)2. Datapoint ∼ Ν(µι, σ2ΙΙΙΙ)µ2x©2005-2009 Carlos Guestrin1631The General GMM assumptionµ1µ2µ3• There are k components• Component i has an associated mean vector mi• Each component generates data from a Gaussian with mean mi and covariance matrix ΣiEach data point is generated according to the following recipe: 1. Pick a component at random: Choose component i with probability P(y=i)2. Datapoint ~ N(mi, Σi)©2005-2009 Carlos Guestrin32Unsupervised Learning:not as hard as it looksSometimes
View Full Document