DOC PREVIEW
Stanford CS 468 - Towards Removing the Curse of Dimensionality

This preview shows page 1-2-19-20 out of 20 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 20 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 20 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 20 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 20 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 20 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Approximate Nearest Neighbors: Towards Removing theCurse of Dimensionality(preliminary version)Piotr IndykRajeev MotwaniyDepartment of Computer ScienceStanford UniversityStanford, CA 94305findyk,[email protected] 30, 1999AbstractThenearest neighborproblem is the following: Given a set ofnpointsP=fp1:::pnginsome metric spaceX, prepro cessPso as to eciently answer queries which require nding thepointinPclosest to a query p ointq2X.We fo cus on the particularly interesting case of thed-dimensional Euclidean space whereX=<dunder somelpnorm. Despite decades of eort,the current solutions are far from satisfactory in fact, for larged, in theory or in practice, theyprovide little improvementover the brute-force algorithm which compares the query p ointtoeach data p oint. Of late, there has b een some interest in theapproximate nearest neighborsproblem, which is: Find a pointp2Pthat is an-approximate nearest neighbor of the queryqin that for allp02P,d(p q)(1 +)d(p0q).We presenttwo algorithmic results for the approximate version that signicantly improvetheknown bounds: (a) preprocessing cost polynomial innandd, and a truly sublinear query timeand, (b) query time polynomial in lognandd, and only a mildly exp onential preprocessing cost~O(n)O(1=)d.Further, applying a classical geometric lemma on random pro jections (for whichwe give a simpler proof ), we obtain the rst known algorithm with polynomial prepro cessingand query time p olynomial indand logn. Unfortunately, for small, the latter is a purelytheoretical result since the exp onent depends on 1=. Experimental results indicate that ourrst algorithm oers orders of magnitude improvement on running times over real data sets. Itskey ingredient is the notion oflocality-sensitive hashingwhichmaybeofindependentinteresthere, wegive applications to information retrieval, pattern recognition, dynamic closest-pairs,and fast clustering algorithms.Supported by a Stanford Graduate Fellowship and NSF Award CCR-9357849.ySupported byaSloanFellowship, an IBM FacultyPartnership Award, an ARO MURI GrantDAAH04-96-1-0007,and NSF Young Investigator Award CCR{9357849.11 Intro ductionThenearest neighbor search(NNS)problem is: Given a set ofnpointsP=fp1:::pngin ametric spaceXwith distance functiond, preprocessPso as to eciently answer queries for ndingthe pointinPclosest to a query pointq2X.We focus on the particularly interesting case of thed-dimensional Euclidean space whereX=<dunder somelpnorm. The low-dimensional case iswell-solved 27], so the main issue is that of dealing with the \curse of dimensionality" 17]. Theproblem was originally p osed in the 1960s by Minsky and Papert 54, pp. 222{225], and despitedecades of eort the current solutions are far from satisfactory. In fact, for larged, in theory orin practice, they provide little improvementover a brute-force algorithm which compares a queryqto eachp2P. The known algorithms are of twotypes: (a) low preprocessing cost but querytime linear innandd and, (b) query time sublinear innand polynomial ind, but with severelyexponential preprocessing costnd. This unfortunate situation carries over to average-case analysis,and even to the-approximate nearest neighbors (-NNS)problem: Find a pointp2Pthatis an-approximate nearest neighbor of the queryq, in that for allp02P,d(p q)(1 +)d(p0q).We presenttwo algorithms for the approximate version that signicantly improve the knownbounds: (a) preprocessing cost polynomial innandd, and a truly sublinear query time and, (b)query time polynomial in lognandd, and only a mildly exponential preprocessing cost~O(n)O(1=)d.Further, by applying a classical geometric lemma on random pro jections (for whichwegive a simpler pro of ), we obtain the rst known algorithm with polynomial preprocessing and querytime polynomial indand logn. Unfortunately, for small, this is a purely theoretical result as theexponent depends on 1=. Exp erimental results 37] indicate that the rst algorithm oers ordersof magnitude improvement on running times over real data sets. Its key ingredient is the notion oflocality-sensitive hashingwhichmay be of indep endentinterest wegive applications to informationretrieval, pattern recognition, dynamic closest-pairs, and fast clustering.Motivation.The nearest neighbors problem is of ma jor importance to a variety of applications,usually involving similarity searching. Some examples are: data compression 36] databases anddata mining 13,39] information retrieval 11, 21, 58] image and video databases 29, 31, 56,61] machine learning 19] pattern recognition 20,26] and, statistics and data analysis 22,45].Typically, the features of the ob jects of interest (documents, images, etc) are represented as p ointsin<dand a distance metric is used to measure (dis)similarityofobjects. The basic problemthen is to p erform indexing or similarity searching for query ob jects. The number of features(i.e., the dimensionality) ranges anywhere from tens to thousands. For example, in multimediaapplications such as IBM's QBIC (Query by Image Content), the number of features could b e severalhundreds 29, 31]. In information retrieval for text do cuments, vector-space representations involveseveral thousands of dimensions, and it is considered to b e a dramatic improvement that dimension-reduction techniques, such as LSI (latentsemantic indexing) 9, 11, 21], principal componentsanalysis 40] or the Karhunen-Loeve transform 44, 50], can reduce the dimensionality to a merefew hundreds!Of late, there has b een an increasing interest in avoiding the curse of dimensionalityby resortingtoapproximatenearest neighbor searching. Since the selection of features and the use of a distancemetric in the applications are rather heuristic and merely an attempt to make mathematically2precise what is after all an essentially aesthetic notion of similarity, it seems likeanoverkill toinsist on the absolute nearest neighbor in fact, determining an-approximate nearest neighbor for areasonable value of,say a small constant, should suce for most practical purposes. Unfortunately,even this relaxation of goals has not removed the curse of dimensionality, although the recent resultsof Kleinberg 46] gives some improvements.Previous Work.Samet 59] surveys a variety of data structures for


View Full Document

Stanford CS 468 - Towards Removing the Curse of Dimensionality

Download Towards Removing the Curse of Dimensionality
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Towards Removing the Curse of Dimensionality and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Towards Removing the Curse of Dimensionality 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?