Princeton COS 598B - Towards Removing the Curse of Dimensionality - D600152

Home> Schools> Princeton University> Computer Science (COS) > COS 598B> Towards Removing the Curse of Dimensionality

DOC PREVIEW

Princeton COS 598B - Towards Removing the Curse of Dimensionality

School name Princeton University

Course Cos 598b- Advanced Topics in Computer Science: Scene Understanding

Pages 10

This preview shows page 1-2-3 out of 10 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 10 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 10 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 10 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 10 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality PIOTR INDYK* RAJEEV MoTwd Department of Computer Science Stanford University Stanford, CA 94305 {indyk,rajeev}Qcs.stanford.edu Abstract The nearest neighbor problem is the follolving: Given a set of n points P = (PI, . . . ,p,} in some metric space X, pre- process P so as to efficiently answer queries which require finding bhe point in P closest to a query point q E X. We fo- cus on the particularly interesting case of the d-dimensional Euclidean space where X = Wd under some Zp norm. De- spite decades of effort, t,he current solutions are far from saabisfactory; in fact, for large d, in theory or in practice, they provide litt,le improvement over the brute-force algo- rithm which compares the query point to each data point. Of late, t,here has been some interest in the approximate newest neighbors problem, which is: Find a point p E P that is an c-approximate nearest neighbor of the query q in t,hat for all p’ E P, d(p, q) < (1 + e)d(p’, q). We present two algorithmic results for the approximate version t,hat significantly improve the known bounds: (a) preprocessing cost polynomial in n and d, and a truly sub- linear query t.ime (for 6 > 1); and, (b) query time polynomial in log-n and d, and only a mildly exponential preprocessing cost* O(n) x 0(1/~)~. Furt.her, applying a classical geometric lemma on random projections (for which we give a simpler proof), we obtain t.he first known algorithm with polynomial preprocessing and query t.ime polynomial in d and log n. Unfortunately, for small E, the latter is a purely theoretical result since bhe e?rponent depends on l/e. Experimental re- suits indicate that our tit algori&m offers orders of mag- nitude improvement on running times over real data sets. Its key ingredient is the notion of locality-sensitive hashing which may be of independent interest; here, we give applica- tions to information ret,rieval, pattern recognition, dynamic closest-pairs, and fast clustering algorithms. *Supported by a Stanford Graduate Fellowship and NSF 1 Introduction The nearest neighbor search (NNS) problem ix Given a set of n points P = (PI, . . . ,p,) in a metric space .Y wit,h distance function d, preprocess P so as to efficiently auswcr queries for finding the point in P closest to a query point q E X. We focus on the particularly interesting case of the d-dimensional Euclidean space where X = Bd under some a, norm. The low-dimensional case is well-solved [26], so t,hc main issue is that of dealing with the ‘,curse of dimcnsional- ity” [16]. The problem was originally posed in t.he 1960s by Minsky and Papert [53, pp. 222-2251, and despite decades of effort the current solutions are far from satisfactory. In fact, for large d, in theory or in practice, they provide little improvement over a brute-force algorithm which composes a query q to each p E P. The known algorithms are of two types: (a) low preprocessing cost but query time linear in n and d; and, (b) query time sublinear in n and polyno- mial in d, but with severely exponential preproceeging cozt nd. This unfortunate situation carries over to average-tax analysis, and even to the c-approximate nearest neigh- bors (e-NNS) problem: Find a point p E P t,hnt is an e-approximate nearest neighbor of the query g, in t.hot for au P’ E P, 4h q) I (1 + +W, 9). We present two algorithms for the approsimoto vemion that significantly improve the known bounds: (a) prcpro- cessing cost polynomial in n and d, and a truly nublinear query time (for B > 1); and, (b) query time polynomial in log n and d, and only a mildly exponential preprocessing cozt O(n) x 0(1/~)~. tither, by applying a classical geomet,ric lemma on random projections (for which we give a eimplcr proof), we obtain the first known algorithm with polynomial preprocessing and query time polynomial in d and log n. Un- fortunately, for small E, this is a purely theoretical result as the exponent depends on l/e. Experimental results [36] in- dicate that the first algorithm offers orders of magnitude improvement on running times over real data sets. Ito key ingredient is the notion of locality-sensitive hashing which may be of independent interest; we give applications to infor- mation retrieval, pattern recognition, dynamic closest-pairs, and fast clustering. Motivation. The nearest neighbors problem is of major importance to a variety of applicat,ions, usually involving 604similarity searching. Some examples are: data compres- slon [3G]; databases and data mining [12, 381; information retrieval [lo, 20, 571; image and video databases [28, 30, 55, 601; machine learning [18]; pattern recognition [19, 251; and, statistics nnd data analysis [21,44]. Typically, the fea- tures of the objects of interest (documents, images, etc) are rcprescntcd as points in 81d and a distance metric is used to measure (die)oimilarity of objects, The basic problem then is to perform indexing or similarity searching for query objects, The number of features (i.e., the dimensionality) rangcn anywhere from tens to thousands. For example, in multimedia applications such as IBM’s QBIC (Query by Im- age Content), the number of features could be several hun- dreds [28, 301. In information retrieval for text documents, vector-space representations involve several thousands of di- mensions, and it is considered to be a dramatic improvement that dimension-reduction techniques, such as LSI (latent se- mantic indexing) [8, 10, 201, principal components analy- oio [39] or the Karhunen-Lo&e transform [43,49], can reduce the dimenoionality to a mere few hundreds! Of late, there has been an increasing interest in avoiding the curse of dimensionality by resorting to approsimatenear- cot nciglbor scorching:. Since the selection of features and

View Full Document