DOC PREVIEW
Stanford CS 468 - A Simple Linear Time

This preview shows page 1-2-3 out of 9 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 9 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 9 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 9 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 9 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

A Simple Linear Time (1 + ε)-Approximation Algorithm for k-Means Clusteringin Any DimensionsAmit KumarDept. of Computer Science& Engg., IIT DelhiNew Delhi-110016, [email protected] SabharwalIBM India Research LabBlock-I, IIT Delhi,New Delhi-110016, [email protected] Sen1Dept. of Computer Science& Engg., IIT DelhiNew Delhi-110016, [email protected] present the first linear time (1+ε)-approximational-gorithm for the k-means problem for fixed k and ε. Our al-gorithm runs in O(nd) time, which is linear in the size ofthe input. Another feature of our algorithm is its simplic-ity – the only technique involved is random sampling.1. IntroductionThe problemof clustering a groupof data items into sim-ilar groups is one of the most widely studied problems incomputer science. Clustering has applications in a variety ofareas, for example, data mining, information retrieval, im-age processing, and web search ([5, 7, 14, 9]). Given thewide range of applications, many different definitions ofclustering exist in the literature ([8, 4]). Most of these defi-nitions begin by defining a notion of distance between twodata items and then try to form clusters so that data itemswith small distance between them get clustered together.Often, clustering problems arise in a geometric setting,i.e., the data items are points in a high dimensional Eu-clidean space. In such settings, it is natural to define thedistance between two points as the Euclidean distance be-tween them. One of the most popular definitions of cluster-ing is the k-means clustering problem. Given a set of pointsP , the k-means clustering problems seeks to find a set K ofk centers, such thatXp∈Pd(p, K)2is minimized. Note that the points in K can be arbitrarypoints in the Euclidean space.Here d(p, K) refers to the dis-tance between p and the closest center in K. We can think1 Author’s present address: Dept of Computer Science and Engneering,IIT Kharagpur 721302.of this as each point in P gets assigned to the closest cen-ter in K. The points that get assigned to the same centerform a cluster. The k-means problem is NP-hard even fork = 2. Another popular definition of clustering is the k-median problem. This is defined in the same manner as thek-means problem except for the fact that the objective func-tion isPp∈Pd(p, K). Observe that the distance measureused in the definition of the k-means problem is not a met-ric. This might lead one to believe that solving the k-meansproblem is more difficult than the k-median problem. How-ever, in this paper, we give strong evidence that this may notbe the case.A lot of research has been devoted to solving the k-means problem exactly (see [11] and thereferences therein).Even the best known algorithms for this problem take atleast Ω(nd) time. Recently, some work has been devotedto finding (1 + ε)-approximation algorithms for the k-means problem, where ε can be an arbitrarily small con-stant. This has led to algorithms with much improved run-ning time. Further, if we look at the applications of the k-means problem, they often involve mapping subjective fea-tures to points in the Euclidean space. Since there is an errorinherent in this mapping, finding a (1 + ε)-approximate so-lution does not lead to a deterioration in the solution for theactual application.In this paper, we give the first truly linear time (1 + ε)-approximation algorithm for the k-means problem. Treat-ing k and ε as constants, our algorithm runs in O(nd) time,which is linear in the size of the input. Another feature ofour algorithm is its simplicity – the only technique involvedis random sampling.1.1. Related workThe fastest exact algorithm for the k-means clusteringproblem was proposed by Inaba et al. [11]. They observedthat the number of Voronoi partitions of k points in <disO(nkd) and so the optimal k-means clustering could be de-termined exactly in time O(nkd+1). They also proposeda randomized (1 + ε)-approximation algorithm for the 2-means clustering problem with running time O(n/εd).Matousek [13] proposed a deterministic (1 + ε)-approximation algorithm for the k-means problemwith running time O(nε−2k2dlogkn). Badoiu et al.[3] proposed a (1 + ε)-approximation algorithm forthe k-median clustering problem with running timeO(2(k/ε)O(1)dO(1)nlogO(k)n). Their algorithm can be ex-tended to get a (1 + ε)-approximation algorithm for thek-means clustering problem with a similar running time. dela Vega et al. [6] proposed a (1 + ε)-approximation algo-rithm for the k-means problem which works well for pointsin high dimensional points in high dimensions. The run-ning time of this algorithm is O(g(k, ε)nlogkn) whereg(k, ε) = exp[(k3/ε8)(ln(k/ε)lnk]. Recently, Har-Peledet al. [10] proposed a (1 + ε)-approximation algo-rithm for the k-means clustering whose running timeis O(n + kk+2ε−(2d+1)klogk+1nlogk1ε). Their algo-rithm is also fairly complicated and relies on severalresults in computational geometry that depend exponen-tially on the number of dimensions. So this is more suitablefor low dimensions only.There exist other definitions of clustering, for example,k-median clustering where the objective is to minimize thesum of the distances to the nearest center and k-center clus-tering, where the objective is to minimize the maximumdis-tance (see [1, 2, 3, 10, 12] and references therein).1.2. Our contributionsWe present a linear time (1 + ε)-approximation algo-rithm for the k-means problem. Treating k and ε as con-stants, the running time of our algorithm is better in com-parison to the previously known algorithms for this prob-lem. However, the algorithm due to Har-Peled and Mazum-dar [10] deserves careful comparison. Note that their algo-rithm, though linear in n, is not linear in the input size ofthe problem, which is dn (for n points in d dimensions).Therefore, their algorithm is better only for low dimen-sions; for d = Ω(log n), our algorithm is much faster. Evenuse of Johnson-Lindenstraus lemma will not make the run-ning time comparable as it has its own overheads. Many re-cent algorithms rely on techniques like exponential grid orscaling that have high overheads. For instance, normaliz-ing with respect to minimum distance between points mayincur an extra Ω(n) cost per point depending on the com-putational model. In [3], the authors have used roundingtechniques based on approximationsof the optimal k-centervalue without specifying the cost incurred


View Full Document

Stanford CS 468 - A Simple Linear Time

Download A Simple Linear Time
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view A Simple Linear Time and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view A Simple Linear Time 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?