DOC PREVIEW
Princeton COS 598B - Linear Time Algorithms

This preview shows page 1-2-19-20 out of 20 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 20 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 20 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 20 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 20 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 20 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Linear Time Algorithms for Clustering Problemsin any dimensionsAmit Kumar1, Yogish Sabharwal2, and Sandeep Sen31Dept of Comp Sc & Engg, Indian Institute of Technology, New Delhi-110016, [email protected] India Research Lab, Block-I, IIT Delhi, Hauz Khas, New Delhi-110016, [email protected] of Comp Sc & Engg, Indian Institute of Technology, Kharagpur, [email protected]. We generalize the k-means algorithm presented by the au-thors [14] and show that the resulting algorithm can solve a larger classof clustering problems that satisfy certain properties (existence of a ran-dom sampling procedure and tightness). We prove these properties forthe k-median and the discrete k-means clustering problems, resulting inO(2(k/ε)O(1)dn) time (1 + ε)-approximation algorithms for these prob-lems. These are the first algorithms for these problems linear in the sizeof the input (nd for n points in d dimensions), independent of dimensionsin the exponent, assuming k and ε to be fixed. A key ingredient of thek-median result is a (1 + ε)-approximation algorithm for the 1-medianproblem which has running time O(2(1/ε)O(1)d). The previous best knownalgorithm for this problem had linear running time.1 IntroductionThe problem of clustering a group of data items into similar groups is one of themost widely studied problems in computer science. Clustering has applicationsin a variety of areas, for example, data mining, information retrieval, imageprocessing, and web search ([5, 7, 16, 9]). Given the wide range of applications,many different definitions of clustering exist in the literature ([8, 4]). Most ofthese definitions begin by defining a notion of distance (similarity) between twodata items and then try to form clusters so that data items with small distancebetween them get clustered together.Often, clustering problems arise in a geometric setting, i.e., the data itemsare points in a high dimensional Euclidean space. In such settings, it is naturalto define the distance between two points as the Euclidean distance betweenthem. Two of the most popular definitions of clustering are the k-means clus-tering problem and the k-median clustering problem. Given a set of points P ,the k-means clustering problems seeks to find a set K of k centers, such thatPp∈Pd(p, K)2is minimized, whereas the k-median clustering problems seeks tofind a set K of k centers, such thatPp∈Pd(p, K) is minimized. Note that thepoints in K can be arbitrary points in the Euclidean space. Here d(p, K) refers2to the distance b etween p and the closest center in K. We can think of this aseach point in P gets assigned to the closest center. The points that get assignedto the same center form a cluster. These problems are NP-hard for even k = 2(when dimension is not fixed). Interestingly, the center in the optimal solution tothe 1-mean problem is the same as the center of mass of the points. Howvever,in the case of the 1-median problem, also known as the Fermat-Weber problem,no such closed form is known. We show that despite the lack of such a closedform, we can obtain an approximation to the optimal 1-median in O(1) time(independent of the number of points). There exist variations to these clusteringproblems, for example, the discrete versions of these problems, where the centersthat we seek are constrained to lie on the input set of points.1.1 Related workA lot of research has been devoted to solving these problems exactly (see [11] andthe references therein). Even the best known algorithms for the k-median and thek-means problem take at least Ω(nd) time. Recently, some work has been devotedto finding (1 + ε)-approximation algorithm for these problems, where ε can bean arbitrarily small constant. This has led to algorithms with much improvedrunning times. Further, if we look at the applications of these problems, theyoften involve mapping subjective features to points in the Euclidean space. Sincethere is an error inherent in this mapping, finding a (1+ε)-approximate solutiondoes not lead to a deterioration in the solution for the actual application.The following table summarizes the recent results for the problems, in thecontext of (1 + ε)-approximation algorithms. Some of these algorithms are ran-domized with the expected runing time holding good for any input.Problem Result Reference1-median O ( n/ε2) Indyk [12]k-median O ( nO(1/ε)+1) for d = 2 Arora [1]O ( n + %kO(1)logO(1)n) (discrete also) Har-Peled et al. [10]where % = exp[O((1 + log1/ε)/ε)d−1]discrete k-median O(%nlognlogk) Kolliop oulos et al. [13]k-means O ( n/εd) for k = 2 Inaba et al. [11]O ( nε−2k2dlogkn) Matousek [15]O ( g(k, ε)nlogkn) de la Vega et al. [6]g(k, ε) = exp[(k3/ε8)(ln(k/ε)lnk ]O ( n + kk+2ε−(2d+1)klogk+1nlogk1ε) Har-Peled et al. [10]O (2(k/ε)O(1)dn) Kumar et al. [14]31.2 Our contributionsIn this paper, we generalize the algorithm of authors [14] to a wide range of clus-tering problems. We define a general class of clustering problems and show thatif certain conditions are satsified, we can get linear time (1 + ε)-approximationalgorithms for these problems. We then use our general framework to get thefollowing results. Given a set of n points P in <d, we present1. a randomized algorithm that generates a candidate center set of size O(21/εO(1)),such that at least one of the points in this set is a (1 + ε)-approximate 1-median of P with constant probability. The running time of the algorithm isO(21/εO(1)d), assuming that the p oints are stored in a suitable data structuresuch that a p oint can be randomly sampled in constant time. This improveson the algorithm of Badoiu et al. [3] which generates a candidate center setof size O(21/ε4log n) in time O(d21/ε4log n).2. a randomized (1 + ε)-approximation algorithm for the 1-median problemwhich runs in time O(21/εO(1)d), assuming that the points are stored ina suitable data structure such that a point can be randomly sampled inconstant time.3. a randomized (1 + ε)-approximation algorithm for the k-median problemwhich runs in O(2(k/ε)O(1)nd) time.4. a randomized (1 + ε)-approximation algorithm for the discrete k-means clus-tering which runs in O(2(k/ε)O(1)nd) time.All our algorithms yield the desired result with constant probability (whichcan be made as close to 1 as we wish by a constant number of repetitions).As mentioned earlier, we generalize the result of the authors in [14] to solve alarger class of clustering problems


View Full Document

Princeton COS 598B - Linear Time Algorithms

Documents in this Course
Lecture

Lecture

117 pages

Lecture

Lecture

50 pages

Load more
Download Linear Time Algorithms
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Linear Time Algorithms and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Linear Time Algorithms 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?