Unsupervised learning or Clustering cont K means Gaussian mixture models Machine Learning 10701 15781 Carlos Guestrin Carnegie Mellon University April 5th 2006 1 Some Data 2 K means 1 Ask user how many clusters they d like e g k 5 2 Randomly guess k cluster Center locations 3 Each datapoint finds out which Center it s closest to 4 Each Center finds the centroid of the points it owns 5 and jumps there 6 Repeat until terminated 3 K means Randomly initialize k centers 0 1 0 k 0 Classify Assign each point j 1 m to nearest center Recenter i becomes centroid of its point Equivalent to i average of its points 4 What is K means optimizing Potential function F C of centers and point allocations C Optimal K means min minC F C 5 Does K means converge Part 1 Optimize potential function Fix optimize C 6 Does K means converge Part 2 Optimize potential function Fix C optimize 7 Coordinate descent algorithms Want mina minb F a b Coordinate descent fix a minimize b fix b minimize a repeat Converges if F is bounded to a often good local optimum as we saw in applet play with it K means is a coordinate descent algorithm 8 One bad case for k means Clusters may overlap Some clusters may be wider than others 9 Gaussian Bayes Classifier Reminder P y i x j p x j y i P y i p x j 1 1 T 1 P y i x j exp x x j i i j i P y i m 2 1 2 2 i 2 10 Predicting wealth from age 11 Predicting wealth from age 12 Learning modelyear mpg maker 21 12 M 1m 12 L 1m 2 2 L 2m M 2m M L 2 m O 13 General parameters O m2 21 12 M 1m 12 L 1m 2 2 L 2m M 2m M L 2 m O 14 Aligned O m parameters 21 0 0 22 0 0 M M 0 0 0 0 0 L 0 0 L 0 23 L M 0 O M 0 L 2 m 1 0 L 0 0 0 0 M 0 2 m 15 Aligned O m parameters 21 0 0 22 0 0 M M 0 0 0 0 0 L 0 0 L 0 23 L M 0 O M 0 L 2 m 1 0 L 0 0 0 0 M 0 2 m 16 Spherical O 1 cov parameters 2 0 0 M 0 0 0 2 0 M 0 0 0 0 2 M 0 0 L 0 L 0 L 0 O M L 2 L 0 17 2 0 0 0 M 0 Spherical O 1 cov parameters 2 0 0 M 0 0 0 2 0 M 0 0 0 0 2 M 0 0 L 0 L 0 L 0 O M L 2 L 0 18 2 0 0 0 M 0 Next back to Density Estimation What if we want to do density estimation with multimodal or clumpy data 19 But we don t see class labels MLE max j P yj xj But we don t know yj s Maximize marginal likelihood max j P xj max j i 1k P yj i xj 20 Special case spherical Gaussians and hard assignments P x j y i 1 1 T 1 exp x x i i j i 2 j 2 m 2 i 1 2 If P X Y i is spherical with same for all classes 2 1 P x j y i exp x j i 2 2 If each xj belongs to one class C j hard assignment marginal likelihood 2 1 P x j y i exp x j C j 2 2 j 1 i 1 j 1 m k Same as K means m 21 The GMM assumption There are k components Component i has an associated mean vector i 2 1 3 22 The GMM assumption There are k components Component i has an associated mean vector i Each component generates data from a Gaussian with mean i and covariance matrix 2I Each data point is generated according to the following recipe 2 1 3 23 The GMM assumption There are k components Component i has an associated mean vector i Each component generates data from a Gaussian with mean i and covariance matrix 2 2I Each data point is generated according to the following recipe 1 Pick a component at random Choose component i with probability P y i 24 The GMM assumption There are k components Component i has an associated mean vector i Each component generates data from a Gaussian with mean i and covariance matrix 2 x 2I Each data point is generated according to the following recipe 1 Pick a component at random Choose component i with probability P y i 2 Datapoint N i 2I 25 The General GMM assumption There are k components Component i has an associated mean vector i Each component generates data from a Gaussian with mean i and covariance matrix i 2 1 3 Each data point is generated according to the following recipe 1 Pick a component at random Choose component i with probability P y i 2 Datapoint N i i 26 Unsupervised Learning not as hard as it looks Sometimes easy Sometimes impossible IN CASE YOU RE WONDERING WHAT THESE DIAGRAMS ARE THEY SHOW 2 d UNLABELED DATA X VECTORS DISTRIBUTED IN 2 d SPACE THE TOP ONE HAS THREE VERY CLEAR GAUSSIAN CENTERS and sometimes in between 27 Marginal likelihood for general case P x j y i 1 1 T 1 x x exp i i j i 2 j 2 m 2 i 1 2 Marginal likelihood m P x j 1 m j k P x j y i j 1 i 1 m k j 1 i 1 1 1 T 1 x x exp j i i j i P y i m 2 1 2 2 i 2 28 Special case 2 spherical Gaussians and soft assignments If P X Y i is spherical with same for all classes 2 1 P x j y i exp x j i 2 2 Uncertain about class of each xj soft assignment marginal likelihood 2 1 P x j y i exp x j i P y i 2 2 j 1 i 1 j 1 i 1 m k m k 29 Unsupervised Learning Mediumly Good News We now have a procedure s t if you give me a guess at 1 2 k I can tell you the prob of the unlabeled data given those s Suppose x s are 1 dimensional There are two classes w1 and w2 P y1 1 3 P y2 2 3 From Duda and Hart 1 There are 25 unlabeled datapoints x1 0 608 x2 1 590 x3 0 235 x4 3 949 x25 0 712 30 Duda Hart s Example We can graph the prob dist function of data given our 1 and 2 estimates We can also graph the …
View Full Document