Spectral Clustering with Perturbed Data

Home> Academic Documents> Spectral Clustering with Perturbed Data

DOC PREVIEW

This preview shows page 1-2-3-4 out of 12 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 12 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 12 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 12 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 12 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 12 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Spectral Clustering with Perturbed DataLing HuangIntel [email protected] YanUC [email protected] I. JordanUC [email protected] TaftIntel [email protected] clustering is useful for a wide-ranging set of applications in areas such asbiological data analysis, image processing and data mining. However, the com-putational and/or communication resources required by the method in processinglarge-scale data are often prohibitively high, and practitioners are often required toperturb the original data in various ways (quantization, downsampling, etc) beforeinvoking a spectral algorithm. In this paper, we use stochastic perturbation theoryto study the effects of data perturbation on the performance of spectral clustering.We show that the error under perturbation of spectral clustering is closely relatedto the perturbation of the eigenvectors of the Laplacian matrix. From this resultwe derive approximate upper bounds on the clustering error. We show that thisbound is tight empirically across a wide range of problems, suggesting that it canbe used in practical settings to determine the amount of data reduction allowed inorder to meet a specification of permitted loss in clustering performance.1 IntroductionA critical problem in machine learning is that of scaling: Algorithms should be effective compu-tationally and statistically as various dimensions of a problem are scaled. One general tool forapproaching large-scale problems is that of clustering or partitioning, in essence an appeal to theprinciple of divide-and-conquer. However, while the output of a clustering algorithm may yield aset of smaller-scale problems that may be easier to tackle, clustering algorithms can themselves becomplex, and large-scale clustering often requires the kinds of preprocessing steps that are invokedfor other machine learning algorithms [1], including proto-clustering steps such as quantization,downsampling and compression. Such preprocessing steps also arise in the distributed sensing anddistributed computing setting, where communication and storage limitations may preclude transmit-ting the original data to centralized processors.A number of recent works have begun to tackle the issue of determining the tradeoffs that ariseunder various “perturbations” of data, including quantization and downsampling [2, 3, 4]. Most ofthese analyses have been undertaken in the context of well-studied domains such as classification,regression and density estimation, for which there are existing statistical analyses of the effect ofnoise on performance. Although extrinsic noise differs conceptually from perturbations to dataimposed by a data analyst to cope with resource limitations, the mathematical issues arising in thetwo cases are similar and the analyses of noise have provided a basis for the study of the tradeoffsarising from perturbations.In this paper we focus on spectral clustering, a class of clustering methods that are based on eigen-decompositions of affinity, dissimilarity or kernel matrices [5, 6, 7, 8]. These algorithms often out-perform traditional clustering algorithms such as the K-means algorithm or hierarchical clustering.To date, however, their impact on real-world, large-scale problems has been limited; in particular,a distributed or “in-network” version of spectral clustering has not yet appeared. Moreover, therehas been little work on the statistical analysis of spectral clustering, and thus there is little theory toguide the design of distributed algorithms. There is an existing literature on numerical techniques for1Procedure SpectralClustering (x1, . . . , xn)Input: n data samples {xi}ni=1, xi∈ RdOutput: Bipartition S and¯S of the input data1. Compute the similarity matrix K:Kij= exp“−kxi−xjk22σ2k”, ∀xi, xj2. Compute the diagonal degree matrix D:Di=Pnj=1Kij3. Compute the normalized Laplacian matrix:L = I − D−1K4. Find the second eigenvector v2of L5. Obtain the two partitions using v2:6. S = {[i] : v2i> 0},¯S = {[i] : v2i≤ 0}Figure 1: A spectral bipartitioning algorithm.????6666Laplacianmatrix errorSimilaritymatrix errorEqn. (7)− (13)Lemma 2 &Eqn. (5), (6)Lemma 3 or 4Proposition 1Assumption AσdKdLk˜v2− v2k2ηEigen errorData errorMis-clusteringratePerturbation analysisError propagationFigure 2: Perturbation analysis: from clusteringerror to data perturbation error.scaling spectral clustering (including downsampling [9, 10] and the relaxation of precision require-ments for the eigenvector computation [7]), but this literature does not provide end-to-end, practicalbounds on error rates as a function of data perturbations.In this paper we present the first end-to-end analysis of the effect of data perturbations on spectralclustering. Our focus is quantization, but our analysis is general and can be used to treat other kindsof data perturbation. Indeed, given that our approach is based on treating perturbations as randomvariables, we believe that our methods will also prove useful in developing statistical analyses ofspectral clustering (although that is not our focus in this paper).The paper is organized as follows. In Section 2, we provide a brief introduction to spectral clustering.Section 3 contains the main results of the paper; specifically we introduce the mis-clustering rateη, and present upper bounds on η due to data perturbations. In Section 4, we present an empiricalevaluation of our analyses. Finally, in Section 5 we present our conclusions.2 Spectral clustering and data perturbation2.1 Background on spectral clustering algorithmsGiven a set of data points {xi}ni=1, xi∈ R1×dand some notion of similarity between all pairs of datapoints xiand xj, spectral clustering attempts to divide the data points into groups such that points inthe same group are similar and points in different groups are dissimilar. The point of departure of aspectral clustering algorithm is a weighted similarity graph G(V, E), where the vertices correspondto data points and the weights correspond to the pairwise similarities. Based on this weighted graph,spectral clustering algorithms form the graph Laplacian and compute an eigendecomposition of thisLaplacian [5, 6, 7]. While some algorithms use multiple eigenvectors and find a k-way clusteringdirectly, the most widely studied algorithms form a bipartitioning of the data by thresholding thesecond eigenvector of the Laplacian (the eigenvector with the


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2-3-4 out of 12 pages.

Please select your school