Unformatted text preview:

Chapter 9Data ClusteringIn this chapter, we examine the problem of data clustering. Consider the twodata sets in Figure 9.1. In the first graph, we see data that have no class labe ls,but there seems to be a natural separation of the data into three clumps. Anatural clustering algorithm would produce three clusters, a nd this would bean unsuperv ised task . On the other hand, the data may include class labels, asseen in the second graph (Class 1 is triangles, Class 2 is asterisks), and in thissupe rvised learning task, we would want to find a function that would tell uswhich class is assigned to each data po int.In this chapter, we review the main clustering algorithms curr ently in use.We will s e e tha t they share many common characteristics, including the as-sumption that the data s e t under study can be accurately clustered using spa-tial information alone. This leads us to consider a clustering algorithm designedsp e c ifically for attracting sets of dynamical systems, and culminates in the de-velopment of a space-time clus tering algorithm.9.1 BackgroundWe will first look at the unsuperv ised clustering task . In this case, the inputto the algorithm is a data set and the desired output is the number of clustersused, and the member ship function that maps the data to its corre spondingcluster index.Definition: Unsupervised Clustering In the unsupervised task, we aregiven a data set, X, whose elements are vectors x(i)∈ IRn. We want a mem-bership function which has a domain in IRnand will output the cluster index(or label). Defining this function as m, we have:m(x(i)) = giwhere giis the integer for the class for the data po int, typically one of theintegers from 1 to k, w here k may be spe c ified or perha ps output fro m thealgorithm.127128 CHAPTER 9. DATA CLUSTERING−4 −3 −2 −1 0 1 2 3−2.5−2−1.5−1−0.500.511.52−4 −3 −2 −1 0 1 2 3−2.5−2−1.5−1−0.500.511.52Figure 9.1: In the first graph, we see data that have no class labels, but thereseems to be a natural separation o f the data into three clumps. An ideal cluster-ing algorithm would produce the three clusters, and this would be an unsuper-vised task. On the other hand, the data may include class labels, as seen in thesecond graph (Class 1 is triangles, Class 2 is asterisks), and in this supervisedlearning ta sk, we would want to find a function that would tell us which classis assigned to each data po int.One of the things that makes this problem so interesting is that it is very ill-posed- there are an infinite number of ways of building a membership function.For two extreme e xamples, consider these two membership functions:m1(x(i)) = 1where i is the index for the data points. Another function that is about as usefulas m1is the following:m2(x(i)) = iIn the first case, we have only one cla ss, and in the second case, we have asmany classes as there are data points! In order to get a useful a lgorithm, wewill need to try to define some kind of error function that we can then minimize.The easiest way to do this is through Voronoi Cells:Definition: Letc(i)ki=1be po ints in IRn. These points form k Voronoi Cells,where the jthcell is defined as the set of points that are closer to cell j thanany other cluster:Vj=nx ∈ IRn| k x −c(j)k ≤ kx − c(i)k, i = 1, 2, . . . , koThe pointsc(i)ki=1are called cluster centers. In the uncommon occurre ncethat a point x lies on the border between cells, it is customary to include it inthe cell whose index is smaller (although one would fashion the decision on theproblem at hand). The reader might note that a Voronoi cell has a piecewiselinear border.Examples in “Nature”9.1. BACKGROUND 129Figure 9.2: Sample Voronoi diagram sketched by hand. First, draw a linebetween neighboring centers (dotted in the figure). These are guides to drawingthe actual borders, s hown in solid black. These are perpendicular bisectors ofthe dotted lines.1. In [?], Voronoi cells are used to define the “area potentially availablearound a tree”. That is, e ach tree in a stand represents the center ofthe cell.2. Using a ma p of the campus with the emergency telephone boxes markedout and used as the centers, we could always tell where the closest phoneis located.We can draw a Voronoi diag ram by hand: Betwe en neighboring cells, drawa line (that will be erased at the end), then draw the perpendicular bisectors foreach line drawn. See Figure 9.2. This can get complicated fairly quickly, so wewill be using Matlab to produce the plots. The a lgorithms that do these plotsare very interesting (see, for example [?]) but will be beyond the scope of ourtext.Matlab Example: Matlab has the the plotting algorithm built-in. For exam-ple, Figure 9.3 shows the output of the follow ing (yours will be slightly differentdue to using random cluster ce nters):X=randn(10,2);voronoi(X(:,1),X(:,2));We can also have Matlab return the vertices of the Voronoi cells to plot themmanually:[vx,vy]=voronoi(X(:,1),X(:,2));plot(vx,vy,’k-’,X(:,1),X(:,2),’r*’);In the algorithms that we work with, it will be convenient to have a functionthat will identify those points within a given cluster; the characteristic function130 CHAPTER 9. DATA CLUSTERING−2.5 −2 −1.5 −1 −0.5 0 0.5−1.5−1−0.500.511.52Figure 9.3: The output of our Matla b example, running the voronoi alg orithmon 10 random points in IR2.is typical for that purpose, and is defined as 0 or 1, dep e nding on whether ornot the data point b e longs to the cluster:χi(x) =1 if m(x) = i0 otherwiseOne general way to measure the g oodness of a clustering algorithm is touse a measure called the distortion error for a given cluster i, and the totaldistortion error1:Ei=1NiNXk=1kx(k)− c(i)k2χi(x(k)) Etotal=pXk=1Ekwhere Niis the number of points in cluster i, and N is the total number ofpoints overall. You might notice that the values being summed are only non-zero for those data points x that are in cluster i. The total error is simply thesum of the individual “errors” or distortions.1Some authors do not square the norms, but it is more convenient for the theory, and doesnot change the end results.9.2. THE LBG ALGORITHM 1319.2 The LBG AlgorithmThe LBG (Linde-Buzo-Gray, see [22]) algorithm2is perhaps the fastest of thischapter, and the easiest to visualize. The membership function for the LBGalgorithm is defined so that we are in cluster i if we are in the Voronoi celldefined by the ithcluster center. That


View Full Document

Whitman MATH 350 - Data Clustering

Documents in this Course
Load more
Download Data Clustering
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Data Clustering and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Data Clustering 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?