Approximate Nearest Neighbors: Towards Removing theCurse of Dimensionality(preliminary version)Piotr IndykRajeev MotwaniyDepartment of Computer ScienceStanford UniversityStanford, CA 94305findyk,[email protected] 30, 1999AbstractThenearest neighborproblem is the following: Given a set ofnpointsP=fp1:::pnginsome metric spaceX, prepro cessPso as to eciently answer queries which require nding thepointinPclosest to a query p ointq2X.We fo cus on the particularly interesting case of thed-dimensional Euclidean space whereX=<dunder somelpnorm. Despite decades of eort,the current solutions are far from satisfactory in fact, for larged, in theory or in practice, theyprovide little improvementover the brute-force algorithm which compares the query p ointtoeach data p oint. Of late, there has b een some interest in theapproximate nearest neighborsproblem, which is: Find a pointp2Pthat is an-approximate nearest neighbor of the queryqin that for allp02P,d(p q)(1 +)d(p0q).We presenttwo algorithmic results for the approximate version that signicantly improvetheknown bounds: (a) preprocessing cost polynomial innandd, and a truly sublinear query timeand, (b) query time polynomial in lognandd, and only a mildly exp onential preprocessing cost~O(n)O(1=)d.Further, applying a classical geometric lemma on random pro jections (for whichwe give a simpler proof ), we obtain the rst known algorithm with polynomial prepro cessingand query time p olynomial indand logn. Unfortunately, for small, the latter is a purelytheoretical result since the exp onent depends on 1=. Experimental results indicate that ourrst algorithm oers orders of magnitude improvement on running times over real data sets. Itskey ingredient is the notion oflocality-sensitive hashingwhichmaybeofindependentinteresthere, wegive applications to information retrieval, pattern recognition, dynamic closest-pairs,and fast clustering algorithms.Supported by a Stanford Graduate Fellowship and NSF Award CCR-9357849.ySupported byaSloanFellowship, an IBM FacultyPartnership Award, an ARO MURI GrantDAAH04-96-1-0007,and NSF Young Investigator Award CCR{9357849.11 Intro ductionThenearest neighbor search(NNS)problem is: Given a set ofnpointsP=fp1:::pngin ametric spaceXwith distance functiond, preprocessPso as to eciently answer queries for ndingthe pointinPclosest to a query pointq2X.We focus on the particularly interesting case of thed-dimensional Euclidean space whereX=<dunder somelpnorm. The low-dimensional case iswell-solved 27], so the main issue is that of dealing with the \curse of dimensionality" 17]. Theproblem was originally p osed in the 1960s by Minsky and Papert 54, pp. 222{225], and despitedecades of eort the current solutions are far from satisfactory. In fact, for larged, in theory orin practice, they provide little improvementover a brute-force algorithm which compares a queryqto eachp2P. The known algorithms are of twotypes: (a) low preprocessing cost but querytime linear innandd and, (b) query time sublinear innand polynomial ind, but with severelyexponential preprocessing costnd. This unfortunate situation carries over to average-case analysis,and even to the-approximate nearest neighbors (-NNS)problem: Find a pointp2Pthatis an-approximate nearest neighbor of the queryq, in that for allp02P,d(p q)(1 +)d(p0q).We presenttwo algorithms for the approximate version that signicantly improve the knownbounds: (a) preprocessing cost polynomial innandd, and a truly sublinear query time and, (b)query time polynomial in lognandd, and only a mildly exponential preprocessing cost~O(n)O(1=)d.Further, by applying a classical geometric lemma on random pro jections (for whichwegive a simpler pro of ), we obtain the rst known algorithm with polynomial preprocessing and querytime polynomial indand logn. Unfortunately, for small, this is a purely theoretical result as theexponent depends on 1=. Exp erimental results 37] indicate that the rst algorithm oers ordersof magnitude improvement on running times over real data sets. Its key ingredient is the notion oflocality-sensitive hashingwhichmay be of indep endentinterest wegive applications to informationretrieval, pattern recognition, dynamic closest-pairs, and fast clustering.Motivation.The nearest neighbors problem is of ma jor importance to a variety of applications,usually involving similarity searching. Some examples are: data compression 36] databases anddata mining 13,39] information retrieval 11, 21, 58] image and video databases 29, 31, 56,61] machine learning 19] pattern recognition 20,26] and, statistics and data analysis 22,45].Typically, the features of the ob jects of interest (documents, images, etc) are represented as p ointsin<dand a distance metric is used to measure (dis)similarityofobjects. The basic problemthen is to p erform indexing or similarity searching for query ob jects. The number of features(i.e., the dimensionality) ranges anywhere from tens to thousands. For example, in multimediaapplications such as IBM's QBIC (Query by Image Content), the number of features could b e severalhundreds 29, 31]. In information retrieval for text do cuments, vector-space representations involveseveral thousands of dimensions, and it is considered to b e a dramatic improvement that dimension-reduction techniques, such as LSI (latentsemantic indexing) 9, 11, 21], principal componentsanalysis 40] or the Karhunen-Loeve transform 44, 50], can reduce the dimensionality to a merefew hundreds!Of late, there has b een an increasing interest in avoiding the curse of dimensionalityby resortingtoapproximatenearest neighbor searching. Since the selection of features and the use of a distancemetric in the applications are rather heuristic and merely an attempt to make mathematically2precise what is after all an essentially aesthetic notion of similarity, it seems likeanoverkill toinsist on the absolute nearest neighbor in fact, determining an-approximate nearest neighbor for areasonable value of,say a small constant, should suce for most practical purposes. Unfortunately,even this relaxation of goals has not removed the curse of dimensionality, although the recent resultsof Kleinberg 46] gives some improvements.Previous Work.Samet 59] surveys a variety of data structures for
View Full Document