Princeton COS 424 - Lecture # 8 - D643180

Home> Schools> Princeton University> Computer Science (COS) > COS 424> Lecture # 8

Princeton COS 424 - Lecture # 8

Pages 10

Download Save

Unformatted text preview:

COS 424: Interacting with DataLecturer: David Blei Lecture # 8Scribe: Ana Pop, Hanjun Kim February 28, 20081 AnnouncementsHomework 2 will come out soon. It will be very long, so start early, please.2 IntroductionIn the previous lectures, we talked about supervised me thods. There, we learn from givendata and then apply this knowledge on a new data p oint, where we predict the categoryof the new data point. We will now look at unsupervised learning methods where it is notclear in what categories, if any, the data will fall into.3 k-means Clustering3.1 ClusteringClustering is used to segment data into groups of similar items. It is useful for automaticallyorganizing data, finding some hidden structure in the data, and a compression form (forexample, representing 3000 data points together in a bin called 1).Some examples of when this would be useful are in predicting buying patterns of cus-tomers, finding patterns/groups of genes to learn the structure of the data on a higherlevel, grouping MySpace users according to different interests. Google would use clusteringto group search results for ”jaguar” into ”car”, ”animal”, and ”OS” categories.3.2 Clustering Set-UpClustering data such as email, gene expression profiles, or purchase histories, are representedas D = {x1, ..., xN}. Since the data is p-dimensional, we represent it as xn= {xn,1, ..., xn,p}.The distance function is d(xn, xm) between two data points. The k groups we want to divideour data into are {z1, ..., zN} where x ∈ {1, ..., K}. So we want to assign a label to eachdata point x. Note that we could assign them randomly, but we want the assignment to bemeaningful.3.3 K-Means on Example DataConsider the following data in Figure 1. A good distance function to use is the squaredEuclidean distance d(xn, xm) =Ppi=1(xn,i− xm,i)2= ||xn− xm||2. But now we want tosegment the data into k groups. We choose k = 4 for now because finding k is complicated.Consider the intuitive steps that the k-means algorithm would take, as shown in Figure2. We begin with k = 4 randomly placed initial means. We recompute the centers of thedata points. Then we reassign the data points to new means. The detailed algorithm is asfollows:Figure 1: 500 2-dimensional data points xn= hxn,1, xn,2i1. Initialization(a) Data is x1:N(b) Randomly pick initial cluster means m1:k2. Repeat(a) Assign each data point to its clos es t mean,zn= arg mini∈1...kd(xn, mi)(b) Compute average distance between all coordinates assigned to the new clusterand the cluster mean,mk=1NkXn:zn=kxn3. Until assignments z1:Ndo not change3.4 Objective FunctionWe measure how well the algorithm is doing using the sum of the squared distances of eachpoint to its respective mean, F (z1:N, m1:k) =12PNn=1||xn− mZn||2. Remember that xnis a data p oint and mznis the mean for that data point. The objective function for ourexample is shown in Figure 3. We find convergence by looking at the relative change betweensuccessive rounds and stop when we have reached what we deem a negligible difference.2Figure 2: Progression of the k-me ans clustering algorithm on data from Figure 13Figure 3: Objective function for the clusters in Figure 23.5 Coordinate Descentk-means is a coordinate descent algorithm. First, it assigns each point to its closest meanwhile keeping the means fixed, thus minimizing F (from the previous s ection) with respectto z1:N. Second, it computes the new means of every cluster while keeping the assign-ments fixed, thus minimizing F with respect to m1:k. Note that since k-means attemptsto minimize both these quantities, it is not a convex function, so it does not have a globalminimum. It does, however, have local minimums, so it is ess ential to run the algorithmmultiple times.3.6 Compressing ImagesTake the application of compressing a picture. In this case, we want to replace pixels(coordinates) with some color assignments (means), effec tively using k-means to compressthe image. The progression of the c oloring of the image is shown in Figure 4 for different kvalues. In this particular application, the objective function tells us how distorted a pictureis c ompared to the original one. Notice that the picture becomes less distorted the moreclusters we use.3.7 K-MedoidsSo far we have only used Euclidean distance as a distance measure. However, when we havediscrete multivariate data, or data that should not be clustered in circles, or data that ison different scales, Euclidean distance is not appropriate. Instead, we use the k-medoids4Figure 4: Using k-means to compress an image with k = 2, 4, 8, 16, 32, 2565algorithm, which do es not require us to know the means, only distances between data points.The k-medoids algorithm is as follows.1. Initialization(a) Data is x1:N(b) Pick initial cluster identities m1:k2. Repeat(a) Assign each data point to its clos es t center,zn= arg mini∈1...kd(xn, mi)(b) Find a data point in a cluster that is closest to the other data points in thecluster,ik= arg minn:zn=kXm:zm=kd(xn, xm)(c) New cluster centers are set to the closest data points, mk= xik3. Until assignments z1:Ndo not change3.8 Choosing kThis is a hard problem. An intuitive way is to choose it in such a way as to end up with”natural” clusters, but this is not very well defined. One heuristic is to use the kink in theobjective function, as in Figure 6. Notice that before k = 4, the successive increase in kyields a large improvement over the previous k. But after k = 4, we do not improve sodrastically anymore. This suggests that k = 4 is the correct value.4 Hierarchical Clustering4.1 IntroductionHierarchical clustering is widely used. It builds a tree of data in order to merge similargroups of points. Thus, visualizing this tree is a good summary of the data. Its advantageover k-means is that there is no need to pick k in advance because it uses a measure ofdistance between groups of data points. To perform agglomerative clustering, we begin byplacing every data point in its own cluster. Then we iteratively merge the closest groups,not necessarily between individual data points but between already clustered points. Werepeat until we have merged all the data into a single cluster. Sample iterations from thisprocess are shown in Figure 7.4.2 DendrogramRunning the algorithm results in a se quence of groupings. Each level of the tree is asegmentation of the data. It is a monotonic algorithm so the similarity between clustersdecreases with each level. This is for larger

View Full Document


School:
Email:
New Password:
Confirm Password:

Princeton COS 424 - Lecture # 8

Sign up for free to view:

Please select your school