DOC PREVIEW
Princeton COS 598B - Towards Removing the Curse of Dimensionality

This preview shows page 1-2-3 out of 10 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 10 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 10 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 10 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 10 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality PIOTR INDYK* RAJEEV MoTwd Department of Computer Science Stanford University Stanford, CA 94305 {indyk,rajeev}Qcs.stanford.edu Abstract The nearest neighbor problem is the follolving: Given a set of n points P = (PI, . . . ,p,} in some metric space X, pre- process P so as to efficiently answer queries which require finding bhe point in P closest to a query point q E X. We fo- cus on the particularly interesting case of the d-dimensional Euclidean space where X = Wd under some Zp norm. De- spite decades of effort, t,he current solutions are far from saabisfactory; in fact, for large d, in theory or in practice, they provide litt,le improvement over the brute-force algo- rithm which compares the query point to each data point. Of late, t,here has been some interest in the approximate newest neighbors problem, which is: Find a point p E P that is an c-approximate nearest neighbor of the query q in t,hat for all p’ E P, d(p, q) < (1 + e)d(p’, q). We present two algorithmic results for the approximate version t,hat significantly improve the known bounds: (a) preprocessing cost polynomial in n and d, and a truly sub- linear query t.ime (for 6 > 1); and, (b) query time polynomial in log-n and d, and only a mildly exponential preprocessing cost* O(n) x 0(1/~)~. Furt.her, applying a classical geometric lemma on random projections (for which we give a simpler proof), we obtain t.he first known algorithm with polynomial preprocessing and query t.ime polynomial in d and log n. Unfortunately, for small E, the latter is a purely theoretical result since bhe e?rponent depends on l/e. Experimental re- suits indicate that our tit algori&m offers orders of mag- nitude improvement on running times over real data sets. Its key ingredient is the notion of locality-sensitive hashing which may be of independent interest; here, we give applica- tions to information ret,rieval, pattern recognition, dynamic closest-pairs, and fast clustering algorithms. *Supported by a Stanford Graduate Fellowship and NSF 1 Introduction The nearest neighbor search (NNS) problem ix Given a set of n points P = (PI, . . . ,p,) in a metric space .Y wit,h distance function d, preprocess P so as to efficiently auswcr queries for finding the point in P closest to a query point q E X. We focus on the particularly interesting case of the d-dimensional Euclidean space where X = Bd under some a, norm. The low-dimensional case is well-solved [26], so t,hc main issue is that of dealing with the ‘,curse of dimcnsional- ity” [16]. The problem was originally posed in t.he 1960s by Minsky and Papert [53, pp. 222-2251, and despite decades of effort the current solutions are far from satisfactory. In fact, for large d, in theory or in practice, they provide little improvement over a brute-force algorithm which composes a query q to each p E P. The known algorithms are of two types: (a) low preprocessing cost but query time linear in n and d; and, (b) query time sublinear in n and polyno- mial in d, but with severely exponential preproceeging cozt nd. This unfortunate situation carries over to average-tax analysis, and even to the c-approximate nearest neigh- bors (e-NNS) problem: Find a point p E P t,hnt is an e-approximate nearest neighbor of the query g, in t.hot for au P’ E P, 4h q) I (1 + +W, 9). We present two algorithms for the approsimoto vemion that significantly improve the known bounds: (a) prcpro- cessing cost polynomial in n and d, and a truly nublinear query time (for B > 1); and, (b) query time polynomial in log n and d, and only a mildly exponential preprocessing cozt O(n) x 0(1/~)~. tither, by applying a classical geomet,ric lemma on random projections (for which we give a eimplcr proof), we obtain the first known algorithm with polynomial preprocessing and query time polynomial in d and log n. Un- fortunately, for small E, this is a purely theoretical result as the exponent depends on l/e. Experimental results [36] in- dicate that the first algorithm offers orders of magnitude improvement on running times over real data sets. Ito key ingredient is the notion of locality-sensitive hashing which may be of independent interest; we give applications to infor- mation retrieval, pattern recognition, dynamic closest-pairs, and fast clustering. Motivation. The nearest neighbors problem is of major importance to a variety of applicat,ions, usually involving 604similarity searching. Some examples are: data compres- slon [3G]; databases and data mining [12, 381; information retrieval [lo, 20, 571; image and video databases [28, 30, 55, 601; machine learning [18]; pattern recognition [19, 251; and, statistics nnd data analysis [21,44]. Typically, the fea- tures of the objects of interest (documents, images, etc) are rcprescntcd as points in 81d and a distance metric is used to measure (die)oimilarity of objects, The basic problem then is to perform indexing or similarity searching for query objects, The number of features (i.e., the dimensionality) rangcn anywhere from tens to thousands. For example, in multimedia applications such as IBM’s QBIC (Query by Im- age Content), the number of features could be several hun- dreds [28, 301. In information retrieval for text documents, vector-space representations involve several thousands of di- mensions, and it is considered to be a dramatic improvement that dimension-reduction techniques, such as LSI (latent se- mantic indexing) [8, 10, 201, principal components analy- oio [39] or the Karhunen-Lo&e transform [43,49], can reduce the dimenoionality to a mere few hundreds! Of late, there has been an increasing interest in avoiding the curse of dimensionality by resorting to approsimatenear- cot nciglbor scorching:. Since the selection of features and


View Full Document

Princeton COS 598B - Towards Removing the Curse of Dimensionality

Documents in this Course
Lecture

Lecture

117 pages

Lecture

Lecture

50 pages

Load more
Download Towards Removing the Curse of Dimensionality
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Towards Removing the Curse of Dimensionality and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Towards Removing the Curse of Dimensionality 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?