Unformatted text preview:

ClusteringThe Problem of ClusteringExampleApplicationsExample: Clustering CD’sThe Space of CD’sDistance MeasuresExamples of Euclidean DistancesNon-Euclidean DistancesJaccard MeasureCosine MeasureSlide 12Cosine-Measure DiagramMethods of ClusteringHierarchical ClusteringSlide 16And in the Non-Euclidean Case?Slide 18Other Approachesk-MeansPopulating ClustersSlide 22How Do We Deal With Big Data?1Clustering2The Problem of ClusteringGiven a set of points, with a notion of distance between points, group the points into some number of clusters, so that members of a cluster are in some sense as nearby as possible.3Examplex xx x x xx x x x x x xx xxxx xx x x x x xx x xx x xx x x x x x xx xx4ApplicationsE-Business-related applications of clustering tend to involve very high-dimensional spaces.The problem looks deceptively easy in a 2-dimensional, Euclidean space.5Example: Clustering CD’sIntuitively, music divides into categories, and customers prefer one or a few categories.But who’s to say what the categories really are?Represent a CD by the customers who bought it.Similar CD’s have similar sets of customers, and vice-versa.6The Space of CD’sThink of a space with one dimension for each customer.Values 0 or 1 only in each dimension.A CD’s point in this space is (x1,x2,…,xk), where xi = 1 iff the i th customer bought the CD.Compare with the “correlated items” matrix: rows = customers; cols. = CD’s.7Distance MeasuresTwo kinds of spaces:Euclidean: points have a location in space, and dist(x,y) = sqrt(sum of square of difference in each dimension).•Some alternatives, e.g. Manhattan distance = sum of magnitudes of differences.Non-Euclidean: there is a distance measure giving dist(x,y), but no “point location.”•Obeys triangle inequality: d(x,y) < d(x,z)+d(z,y).•Also, d(x,x) = 0; d(x,y) > 0; d(x,y) = d(y,x).8Examples of Euclidean Distancesx = (5,5)y = (9,8)L2-norm:dist(x,y) =sqrt(42+32)= 5L1-norm:dist(x,y) =4+3 = 74359Non-Euclidean DistancesJaccard measure for binary vectors = ratio of intersection (of components with 1) to union.Cosine measure = angle between vectors from the origin to the points in question.10Jaccard MeasureExample: p1 = 00111; p2 = 10011.Size of intersection = 2; union = 4, J.M. = 1/2.Need to make a distance function satisfying triangle inequality and other laws.dist(p1,p2) = 1 - J.M. works.dist(x,x) = 0, etc.11Cosine MeasureThink of a point as a vector from the origin (0,0,…,0) to its location.Two points’ vectors make an angle, whose cosine is the normalized dot-product of the vectors.Example p1 = 00111; p2 = 10011.p1.p2 = 2; |p1| = |p2| = sqrt(3).cos( ) = 2/3.12Example010011 11010110000111013Cosine-Measure Diagramp1p2p1.p2|p2|dist(p1, p2) = 14Methods of ClusteringHierarchical:Initially, each point in cluster by itself.Repeatedly combine the two “closest” clusters into one.Centroid-based:Estimate number of clusters and their centroids.Place points into closest cluster.15Hierarchical ClusteringKey problem: as you build clusters, how do you represent the location of each cluster, to tell which pair of clusters is closest?Euclidean case: each cluster has a centroid = average of its points.Measure intercluster distances by distances of centroids.16Example (5,3)o (1,2)oo (2,1) o (4,1)o (0,0) o (5,0)x (1.5,1.5)x (4.5,0.5)x (1,1)x (4.7,1.3)17And in the Non-Euclidean Case?The only “locations” we can talk about are the points themselves.Approach 1: Pick a point from a cluster to be the clustroid = point with minimum maximum distance to other points.Treat clustroid as if it were centroid, when computing intercluster distances.18Example1 23456interclusterdistanceclustroidclustroid19Other ApproachesApproach 2: let the intercluster distance be the minimum of the distances between any two pairs of points, one from each cluster.Approach 3: Pick a notion of “cohesion” of clusters, e.g., maximum distance from the clustroid.Merge clusters whose combination is most cohesive.20k-MeansAssumes Euclidean space.Starts by picking k, the number of clusters.Initialize clusters by picking one point per cluster.For instance, pick one point at random, then k -1 other points, each as far away as possible from the previous points.21Populating ClustersFor each point, place it in the cluster whose centroid it is nearest.After all points are assigned, fix the centroids of the k clusters.Reassign all points to their closest centroid.Sometimes moves points between clusters.22Example1234567 8xx23How Do We Deal With Big Data?Random-sample approaches.E.g., CURE takes a sample, gets a rough outline of the clusters in main memory, then assigns points to the closest cluster.BFR (Bradley-Fayyad-Reina) is a k-means variant that compresses points near the center of clusters.Also compresses groups of


View Full Document

Stanford CS 206 - Clustering

Documents in this Course
Load more
Download Clustering
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Clustering and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Clustering 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?