DOC PREVIEW
Princeton COS 592 - Entropy based Nearest Neighbor

This preview shows page 1-2-3 out of 10 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 10 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 10 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 10 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 10 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Entropy based Nearest Neighbor Search in High DimensionsRina Panigrahy∗November 4, 2005AbstractIn this paper we study the problem of finding the ap-proximate nearest neighbor of a query point in the highdimensional space, focusing on the Euclidean space. Theearlier approaches use locality-preserving hash functions(that tend to map nearby points to the same value) toconstruct several hash tables to ensure that the querypoint hashes to the same bucket as its nearest neigh-bor in at least one table. Our approach is different –we use one (or a few) hash table and hash several ran-domly chosen points in the neighborhood of the querypoint showing that at least one of them will hash to thebucket containing its nearest neighbor. We show thatthe number of randomly chosen points in the neighbor-hood of the query point q required depends on the en-tropy of the hash value h(p) of a random point p at thesame distance from q at its nearest neighbor, given qand the locality preserving hash function h chosen ran-domly from the hash family. Precisely, we show thatif the entropy I(h(p)|q, h)=M and g is a bound onthe probability that two far-off points will hash to thesame bucket, then we can find the approximate near-est neighbor in O(nρ) time and near linear˜O(n)spacewhere ρ = M/ log(1/g). Alternatively we can build adata structure of size˜O(n1/(1−ρ)) to answer queries in˜O(d) time. By applying this analysis to the locality pre-serving hash functions in [17, 21, 6] and adjusting theparameters we show that the c nearest neighbor can becomputed in time˜O(nρ) and near linear space whereρ ≈ 2.06/c as c becomes large.1 IntroductionIn this paper we study the problem of finding the near-est neighbor of a query point in the high dimensionalEuclidean space: given a database of n points in a d∗Department of Computer Science, Stanford University, Stan-ford, CA 94305. Supported in part by Stanford Graduate Fel-lowship, NSF Grants EIA-0137761 and ITR-0331640, and a grantfrom SNRC. [email protected] space, find the nearest neighbor of a querypoint. This fundamental problem arises in several appli-cations including data mining, information retrieval, andimage search where distinctive features of the objects arerepresented as points in Rd[25,27,4,7,11,10,24,8].While the exact problem seems to suffer from the “curseof dimensionality” (that is, either the query time or thespace requried is exponential in d [9, 23]), many effi-cient techniques have been devised for finding an approx-imate solution whose distance from the query point is atmost 1 +  times its distance from the nearest neighbor.[2, 20, 17, 21, 12]. The best known algorithm for find-ing an (1 + )-approximate nearest neighbor of a querypoint runs in time˜O(d log n) using a data structure ofsize (nd)O(1/2). Since the exponent of the space require-ment grows as 1/2, in practice this may be prohibitivelyexpensive for small . Indeed, since even a space com-plexity of (nd)2may be too large, perhaps it makes moresense to interpret these results as efficient, practical al-gorithms for c-approximate nearest neighbor where c isa constant greater than one. Also, this is meaningful inpractice as typically when we are given a query point weare really interested in finding a neighbor that is muchcloser to the query point than the other points – thequery point (say an image) really represents the ‘sameobject’ as the nearest neighbor we expect it to ‘match’except that they may differ a little due to noise, or in-herent errors in how well points represents their objects,but it is expected to be quite far from the other points inthe database which basically represent ‘different objects’from the query point.For these parameters, Indyk and Motwani [17] providean algorithm for finding the c-approximate nearest neigh-bor in time˜O(d + n1/c) using an index of size˜O(n1+1/c)(while their paper states a query time of˜O(dn1/c), if dis large this can easily be converted to˜O(d+n1/c)bydi-mension reduction); with a data structure of near linearsize, for the hamming space, the algorithms in [17, 21]require a query time of nO(log c/c). To put this in perspec-tive, finding a 2-approximate nearest neighbor requirestime O(√n) and an index of size O(n√n). The exponentwas improved slightly in [6] for c in [1, 10] – instead of1/c it was β/c where β is a constant slightly less than 1for c<10; for example when c = 2 they can reduce theexponent to approximately 0.42 implying a running timeof n0.42andanindexofsizen1.42. Their simulation re-sults indicate that while locality sensitive hashing givesfaster query time over other data structures based onkd-tree, it also comes at the expense of using a lot morespace. They work with the following decision versionof the c-approximate nearest neighbor problem: given aquery point, and a parameter r for the distance to itsnearest neighbor, find any neighbor of the query pointthat is that distance at most cr. It is well known that thereduction to the decision version adds only a logarithmicfactor in the time and space complexity [17, 12].In their formulation, they use a locality sensitive hashfunction that maps points in the space to a discrete spacewhere nearby points out likely to get hashed to the samevalue and far off points out likely to get hashed to differ-ent values. Precisely, given parameter m that denotes anupper bound on the probability that two points at mostr apart hash to the same bucket and g a lower boundon the probability that two points more than cr aparthash to the same bucket, they show that such a hashfunction can find a c-approximate nearest neighbor in˜O(d + nρ)timeusingadatastructureofsize˜O(n1+ρ)where ρ = log(1/m)/log(1/g).Their approach is to construct several hash tables toensure that the query point hashes to the same bucket asits nearest neighbor in at least one table. Our approachis different – we use one (or a few) hash table and hashseveral randomly chosen points in the neighborhood ofthe query point showing that at least one of them willhash to the bucket containing its nearest neighbor. Weshow that the number of randomly chosen points in theneighborhood of the query point q required depends onthe entropy of the hash value h(p) of a random point pat distance r from q,givenq and the locality preservinghash function h chosen randomly from the hash family.Precisely, we show that if the entropy I(h(p)|q,h)=Mthen we can


View Full Document

Princeton COS 592 - Entropy based Nearest Neighbor

Download Entropy based Nearest Neighbor
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Entropy based Nearest Neighbor and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Entropy based Nearest Neighbor 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?