Unsupervised Learning Clustering K means Machine Learning 10701 15781 Carlos Guestrin Carnegie Mellon University 2005 2009 Carlos Guestrin November 18th 2009 2005 2007 Carlos Guestrin 1 1 Some Data 2005 2009 Carlos Guestrin 2 1 K means 1 Ask user how many clusters they d like e g k 5 2005 2009 Carlos Guestrin 3 K means 1 Ask user how many clusters they d like e g k 5 2 Randomly guess k cluster Center locations 2005 2009 Carlos Guestrin 4 2 K means 1 Ask user how many clusters they d like e g k 5 2 Randomly guess k cluster Center locations 3 Each datapoint finds out which Center it s closest to Thus each Center owns a set of datapoints 2005 2009 Carlos Guestrin 5 K means 1 Ask user how many clusters they d like e g k 5 2 Randomly guess k cluster Center locations 3 Each datapoint finds out which Center it s closest to 4 Each Center finds the centroid of the points it owns 2005 2009 Carlos Guestrin 6 3 K means 1 Ask user how many clusters they d like e g k 5 2 Randomly guess k cluster Center locations 3 Each datapoint finds out which Center it s closest to 4 Each Center finds the centroid of the points it owns 5 and jumps there 6 Repeat until terminated 2005 2009 Carlos Guestrin 7 K means Randomly initialize k centers 0 1 0 k 0 Classify Assign each point j 1 m to nearest center Recenter i becomes centroid of its point Equivalent to 2005 2009 Carlos Guestrin i average of its points 8 4 What is K means optimizing Potential function F C of centers and point allocations C Optimal K means min minC F C 9 2005 2009 Carlos Guestrin Does K means converge Part 1 Optimize potential function Fix optimize C 2005 2009 Carlos Guestrin 10 5 Does K means converge Part 2 Optimize potential function Fix C optimize 2005 2009 Carlos Guestrin 11 Coordinate descent algorithms Want mina minb F a b Coordinate descent fix a minimize b fix b minimize a repeat Converges if F is bounded to a often good local optimum as we saw in applet play with it K means is a coordinate descent algorithm 2005 2009 Carlos Guestrin 12 6 EM Machine Learning 10701 15781 Carlos Guestrin Carnegie Mellon University 2005 2009 Carlos Guestrin November 18th 2009 13 2005 2007 Carlos Guestrin One bad case for k means 2005 2009 Carlos Guestrin Clusters may overlap Some clusters may be wider than others 14 7 Gaussian Bayes Classifier Reminder P y i x j P y i x j p x j y i P y i p x j 2 T 1 1 exp x j i 1 i x j i P y i 1 2 2 i m 2 2005 2009 Carlos Guestrin 15 Predicting wealth from age 2005 2009 Carlos Guestrin 16 8 Predicting wealth from age 17 2005 2009 Carlos Guestrin Learning modelyear mpg maker 2005 2009 Carlos Guestrin 21 12 M 1m 12 L 1m 2 2 L 2m M 2m M L 2m O 18 9 General O m2 parameters 21 12 M 1m 12 L 1m 2 2 L 2m M 2m 19 2005 2009 Carlos Guestrin Aligned O m parameters 2005 2009 Carlos Guestrin M L 2m O 21 0 2 0 2 0 0 M M 0 0 0 0 0 0 23 M 0 0 L 0 0 L 0 0 L 0 0 O M M L 2 m 1 0 L 0 2m 20 10 Aligned O m parameters 21 0 2 0 2 0 0 M M 0 0 0 0 0 0 23 M 0 0 L 0 0 L 0 0 L 0 0 O M M 0 L 2 m 1 L 0 2m 21 2005 2009 Carlos Guestrin Spherical O 1 cov parameters 2005 2009 Carlos Guestrin 2 0 0 M 0 0 0 2 0 0 L 0 0 L 0 2 L 0 M M O 0 0 L 2 0 0 L 0 0 0 M 0 2 M 0 22 11 Spherical O 1 cov parameters 2 0 0 M 0 0 0 2 0 0 L 0 0 L 0 2 L 0 M M O 0 0 L 2 0 0 L 2005 2009 Carlos Guestrin 0 0 0 M 0 2 M 0 23 Next back to Density Estimation What if we want to do density estimation with multimodal or clumpy data 2005 2009 Carlos Guestrin 24 12 But we don t see class labels MLE argmax j P yj xj But we don t know yj s Maximize marginal likelihood argmax j P xj argmax j i 1k P yj i xj 25 2005 2009 Carlos Guestrin Special case spherical Gaussians and hard assignments P y i x j 2 If P X Y i is spherical with same for all classes 1 2 P x j y i exp 2 x j i 2 If each xj belongs to one class C j hard assignment marginal likelihood m k P x j 1 i 1 T 1 1 exp x j i 1 i x j i P y i 1 2 2 i m 2 j m 2 1 y i exp 2 x j C j 2 j 1 Same as K means 2005 2009 Carlos Guestrin 26 13 The GMM assumption There are k components Component i has an associated mean vector 2 1 3 27 2005 2009 Carlos Guestrin The GMM assumption There are k components Component i has an associated mean vector Each component generates data from a Gaussian with mean mi and covariance matrix 2 Each data point is generated according to the following recipe 2005 2009 Carlos Guestrin 2 1 3 28 14 The GMM assumption There are k components Component i has an associated mean vector Each component generates data from a Gaussian with mean mi and covariance matrix 2 2 Each data point is generated according to the following recipe 1 Pick a component at random Choose component i with probability P y i 29 2005 2009 Carlos Guestrin The GMM assumption There are k components Component i has an associated mean vector Each component generates data from a Gaussian with mean mi and covariance matrix 2 x 2 Each data point is generated according to the following recipe 1 Pick a component at random Choose component i with probability P y i 2 Datapoint 2 2005 2009 Carlos Guestrin 30 15 The General GMM assumption There are k components Component i has an associated mean vector mi Each component generates data from a Gaussian with mean mi and covariance matrix 2 1 i 3 Each data point is generated according to the following recipe 1 Pick a component at random Choose component i with probability P y i 2 Datapoint N mi i 31 2005 2009 Carlos Guestrin Unsupervised Learning not as hard as it looks Sometimes easy Sometimes impossible IN CASE YOU RE WONDERING WHAT THESE DIAGRAMS ARE THEY SHOW 2 d …
View Full Document