EM Machine Learning 10701 15781 Carlos Guestrin Carnegie Mellon University November 19th 2007 2005 2007 Carlos Guestrin 1 K means 1 Ask user how many clusters they d like e g k 5 2 Randomly guess k cluster Center locations 3 Each datapoint finds out which Center it s closest to 4 Each Center finds the centroid of the points it owns 5 and jumps there 6 Repeat until terminated 2 K means Randomly initialize k centers 0 1 0 k 0 Classify Assign each point j 1 m to nearest center Recenter i becomes centroid of its point Equivalent to i average of its points 3 Does K means converge Part 2 Optimize potential function Fix C optimize 4 Coordinate descent algorithms Want mina minb F a b Coordinate descent fix a minimize b fix b minimize a repeat Converges if F is bounded to a often good local optimum as we saw in applet play with it K means is a coordinate descent algorithm 5 One bad case for k means Clusters may overlap Some clusters may be wider than others 6 Gaussian Bayes Classifier Reminder P y i x j P y i x j p x j y i P y i p x j 2 T 1 1 exp x j i 1 i x j i P y i 1 2 2 i m 2 7 Predicting wealth from age 8 Predicting wealth from age 9 Learning modelyear mpg maker 21 12 M 1m 12 22 M 2m L 1m L 2m O M L 2m 10 General O m2 parameters 21 12 M 1m 12 22 M 2m L 1m L 2m O M L 2m 11 21 0 2 0 2 0 0 M M 0 0 0 0 Aligned O m parameters 0 L 0 L 23 L 0 0 0 M 0 O M L 2 m 1 0 L 0 M 0 2m 0 0 0 12 21 0 2 0 2 0 0 M M 0 0 0 0 Aligned O m parameters 0 0 23 M 0 0 L L L O 0 2m 0 0 0 M 0 0 0 M L 2 m 1 L 0 13 2 0 0 M 0 0 Spherical O 1 cov parameters 0 2 0 0 L L 0 0 0 M 2 M 0 0 L O 0 M 0 0 L 2 L 0 14 0 0 0 M 0 2 2 0 0 M 0 0 Spherical O 1 cov parameters 0 2 0 M 0 0 0 0 2 M 0 0 L 0 L 0 L 0 O M L 2 L 0 0 0 0 M 0 2 15 Next back to Density Estimation What if we want to do density estimation with multimodal or clumpy data 16 But we don t see class labels MLE argmax j P yj xj But we don t know yj s Maximize marginal likelihood argmax j P xj argmax j i 1k P yj i xj 17 Special case spherical Gaussians and hard assignments P y i x j 2 T 1 1 exp x j i 1 i x j i P y i 1 2 2 i m 2 If P X Y i is spherical with same for all classes 2 1 P x j y i exp 2 x j i 2 If each xj belongs to one class C j hard assignment marginal likelihood m k P x j 1 i 1 j m 2 1 y i exp 2 x j C j 2 j 1 Same as K means 18 The GMM assumption There are k components Component i has an associated mean vector i 2 1 3 19 The GMM assumption There are k components Component i has an associated mean vector i Each component generates data from a Gaussian with mean i and covariance matrix 2I Each data point is generated according to the following recipe 2 1 3 20 The GMM assumption There are k components Component i has an associated mean vector i Each component generates data from a Gaussian with mean i and covariance matrix 2I 2 Each data point is generated according to the following recipe 1 Pick a component at random Choose component i with probability P y i 21 The GMM assumption There are k components Component i has an associated mean vector i Each component generates data from a Gaussian with mean i and covariance matrix 2I 2 x Each data point is generated according to the following recipe 1 Pick a component at random Choose component i with probability P y i 2 Datapoint N i 2I 22 The General GMM assumption There are k components Component i has an associated mean vector i Each component generates data from a Gaussian with mean i and covariance matrix i 2 1 3 Each data point is generated according to the following recipe 1 Pick a component at random Choose component i with probability P y i 2 Datapoint N i i 23 Unsupervised Learning not as hard as it looks Sometimes easy Sometimes impossible IN CASE YOU RE WONDERING WHAT THESE DIAGRAMS ARE THEY SHOW 2 d UNLABELED DATA X VECTORS DISTRIBUTED IN 2 d SPACE THE TOP ONE HAS THREE VERY CLEAR GAUSSIAN CENTERS and sometimes in between 24 Marginal likelihood for general case P y i x j Marginal likelihood m m k P x 2 T 1 1 exp x j i 1 i x j i P y i 1 2 i 2 m 2 j P x j y i j 1 j 1 i 1 m k j 1 i 1 2 T 1 1 1 exp x j i i x j i P y i 1 2 i 2 m 2 25 Special case 2 spherical Gaussians and soft assignments If P X Y i is spherical with same for all classes 2 1 P x j y i exp 2 x j i 2 Uncertain about class of each xj soft assignment marginal likelihood m k P x j 1 i 1 j m k 2 1 y i exp 2 x j i P y i 2 j 1 i 1 26 Unsupervised Learning Mediumly Good News We now have a procedure s t if you give me a guess at 1 2 k I can tell you the prob of the unlabeled data given those s Suppose x s are 1 dimensional From Duda and Hart There are two classes w1 and w2 P y1 1 3 P y2 2 3 1 There are 25 unlabeled datapoints x1 x2 x3 x4 0 608 1 590 0 235 3 949 x25 0 712 27 Duda Hart s Example We can graph the prob dist function of data given our 1 and 2 estimates We can also graph the true function from which the data was randomly generated They are close Good The 2nd solution tries to put the 2 3 hump where the 1 3 hump should go and vice versa In this example unsupervised is almost as good as …
View Full Document