ClusteringThe Problem of ClusteringExampleApplicationsExample: Clustering CD’sThe Space of CD’sDistance MeasuresExamples of Euclidean DistancesNon-Euclidean DistancesJaccard MeasureCosine MeasureSlide 12Cosine-Measure DiagramMethods of ClusteringHierarchical ClusteringSlide 16And in the Non-Euclidean Case?Slide 18Other Approachesk-MeansPopulating ClustersSlide 22How Do We Deal With Big Data?1Clustering2The Problem of ClusteringGiven a set of points, with a notion of distance between points, group the points into some number of clusters, so that members of a cluster are in some sense as nearby as possible.3Examplex xx x x xx x x x x x xx xxxx xx x x x x xx x xx x xx x x x x x xx xx4ApplicationsE-Business-related applications of clustering tend to involve very high-dimensional spaces.The problem looks deceptively easy in a 2-dimensional, Euclidean space.5Example: Clustering CD’sIntuitively, music divides into categories, and customers prefer one or a few categories.But who’s to say what the categories really are?Represent a CD by the customers who bought it.Similar CD’s have similar sets of customers, and vice-versa.6The Space of CD’sThink of a space with one dimension for each customer.Values 0 or 1 only in each dimension.A CD’s point in this space is (x1,x2,…,xk), where xi = 1 iff the i th customer bought the CD.Compare with the “correlated items” matrix: rows = customers; cols. = CD’s.7Distance MeasuresTwo kinds of spaces:Euclidean: points have a location in space, and dist(x,y) = sqrt(sum of square of difference in each dimension).•Some alternatives, e.g. Manhattan distance = sum of magnitudes of differences.Non-Euclidean: there is a distance measure giving dist(x,y), but no “point location.”•Obeys triangle inequality: d(x,y) < d(x,z)+d(z,y).•Also, d(x,x) = 0; d(x,y) > 0; d(x,y) = d(y,x).8Examples of Euclidean Distancesx = (5,5)y = (9,8)L2-norm:dist(x,y) =sqrt(42+32)= 5L1-norm:dist(x,y) =4+3 = 74359Non-Euclidean DistancesJaccard measure for binary vectors = ratio of intersection (of components with 1) to union.Cosine measure = angle between vectors from the origin to the points in question.10Jaccard MeasureExample: p1 = 00111; p2 = 10011.Size of intersection = 2; union = 4, J.M. = 1/2.Need to make a distance function satisfying triangle inequality and other laws.dist(p1,p2) = 1 - J.M. works.dist(x,x) = 0, etc.11Cosine MeasureThink of a point as a vector from the origin (0,0,…,0) to its location.Two points’ vectors make an angle, whose cosine is the normalized dot-product of the vectors.Example p1 = 00111; p2 = 10011.p1.p2 = 2; |p1| = |p2| = sqrt(3).cos( ) = 2/3.12Example010011 11010110000111013Cosine-Measure Diagramp1p2p1.p2|p2|dist(p1, p2) = 14Methods of ClusteringHierarchical:Initially, each point in cluster by itself.Repeatedly combine the two “closest” clusters into one.Centroid-based:Estimate number of clusters and their centroids.Place points into closest cluster.15Hierarchical ClusteringKey problem: as you build clusters, how do you represent the location of each cluster, to tell which pair of clusters is closest?Euclidean case: each cluster has a centroid = average of its points.Measure intercluster distances by distances of centroids.16Example (5,3)o (1,2)oo (2,1) o (4,1)o (0,0) o (5,0)x (1.5,1.5)x (4.5,0.5)x (1,1)x (4.7,1.3)17And in the Non-Euclidean Case?The only “locations” we can talk about are the points themselves.Approach 1: Pick a point from a cluster to be the clustroid = point with minimum maximum distance to other points.Treat clustroid as if it were centroid, when computing intercluster distances.18Example1 23456interclusterdistanceclustroidclustroid19Other ApproachesApproach 2: let the intercluster distance be the minimum of the distances between any two pairs of points, one from each cluster.Approach 3: Pick a notion of “cohesion” of clusters, e.g., maximum distance from the clustroid.Merge clusters whose combination is most cohesive.20k-MeansAssumes Euclidean space.Starts by picking k, the number of clusters.Initialize clusters by picking one point per cluster.For instance, pick one point at random, then k -1 other points, each as far away as possible from the previous points.21Populating ClustersFor each point, place it in the cluster whose centroid it is nearest.After all points are assigned, fix the centroids of the k clusters.Reassign all points to their closest centroid.Sometimes moves points between clusters.22Example1234567 8xx23How Do We Deal With Big Data?Random-sample approaches.E.g., CURE takes a sample, gets a rough outline of the clusters in main memory, then assigns points to the closest cluster.BFR (Bradley-Fayyad-Reina) is a k-means variant that compresses points near the center of clusters.Also compresses groups of
View Full Document