Fast Locality-Sensitive Hashing

Home> Academic Documents> Fast Locality-Sensitive Hashing

DOC PREVIEW

This preview shows page 1-2-3 out of 9 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 9 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 9 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 9 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 9 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Fast Locality-Sensitive HashingAnirban Dasgupta Ravi Kumar Tamás SarlósYahoo! Research701 First AvenueSunnyvale, CA 94089{anirban,ravikumar,stamas}@yahoo-inc.comABSTRACTLocality-sensitive hashing (LSH) is a basic primitive in severallarge-scale data processing applications, including nearest-neighborsearch, de-duplication, clustering, etc. In this paper we propose anew and simple method to speed up the widely-used E uclidean re-alization of LS H . At the heart of our method is a fast way to es-timate the Euclidean distance between two d-dimensional vectors;this is achieved by the use of randomized Hadamard transforms ina non-linear setting. This decreases the running time of a (k, L)-parameterized LSH from O(dkL) to O(d log d + kL). Our exper-iments show that using the new LSH in nearest-neighbor applica-tions can improve their running times by significant amounts. Tothe best of our knowledge, this is the first running time improve-ment to LSH that is both provable and practical.Categories and Subject Descriptors. H.3.3 [Information Stor-age and Retrieval]: Information Search and Retrieval; H.4.m [In-formation Systems Applications]: Miscellaneous; G.3 [Mathe-matics of Computing]: Probability and Statistics—ProbabilisticalgorithmsGeneral Terms. Algorithms, Experimentation, Performance, The-oryKeywords. Nearest neighbor search, Locality-sensitive hashing,Hadamard transform, Dimension reduction1. INTRODUCTIONLocality sensitive hashing (LSH) is a basic primitive in large-scale data processing algorithms that are designed to operate onobjects (with features) in high dimensions. The idea behind LSH[23, 22] is the following: construct a family of functions that hashobjects into buckets such that objects that are similar will be hashedto the same bucket with high probability. Here, the type of the ob-jects and the notion of similarity between them determine the par-ticular hash function family. Typical instances include the Jaccardcoefficient as similarity when the underlying objects are sets andthe ℓ2norm as distance (i.e., dissimilarity) or the cosine/angle assimilarity when the underlying objects are vectors.Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.KDD’11, August 21–24, 2011, San Diego, California, USA.Copyright 2011 ACM 978-1-4503-0813-7/11/08 ...$10.00.With such a powerful primi tive, many large-scale data process-ing problems that previously suffered from the “curse of dimen-sionality" suddenly become tractable. For instance, in conjunctionwith standard indexing techniques, it becomes possible to searchfor nearest-neighbors efficiently [16]: given a query, hash the queryinto a bucket, use the objects in the bucket as candidates for nearneighbors, and rank them according to their simil arity to the query.Likewise, popular operations such as near-duplicate detection [7,28, 19], all-pairs similarity [ 6], similarity join/record-linkage [25],temporal correlation [10], are made easier.In t his paper objects are vectors and the distance between vec-tors i s the E uclidean (ℓ2) norm. This choice is motivated by manyapplications in text processing, image/video indexing/retrieval, nat-ural language processing, etc. When the si mi larity measure be-tween vectors is their angle, Charikar [9] gave a very simple LSHcalled SIMHASH: the hash of an input vector is the sign of its in-ner product with a r andom unit vector. It can be shown that theprobability that the hashes of two vectors agree is a function ofthe angle between the underlying vectors [17]. Datar et al. [ 12]present constructions based on stable distributions for the ℓpnorm;their construct is almost identical to Charikar’s in the ℓ2case. Sincethen, there have been efforts to make these LSH constructions moreefficient and practical [27, 13, 5]. Each of these LSH construc-tions works by first using a randomized estimator of the similar-ity measure, pasting multiple such constructions to create one hashfunction with an appropriately small collision rate, and then usingmultiple such hash functions to get the right tradeoff between thecollision rate and recall.Main results. In this paper we obtain algorithmic improvementsto the query time of the basic LSH in the ℓ2setting. At a highlevel, we improve the query time of the LSH for estimating ℓ2ford-dimensional vectors from O(dkL) to O ( d log d + kL), whereroughly, each hash f unction maps into a vector of k bits and L dif-ferent hash functions are used. Experiments on different, large, andhigh-dimensional datasets show that these theoretical gains trans-late to a typical improvement of 20% or more in the LSH querytime. We also extend our results to the angle-based similarity,where we improve the query-time to ǫ-approximate the angle be-tween two d-dimensional vectors from O(d/ǫ2) to O(d log 1/ǫ +1/ǫ2); we postpone the details of this result to the full version ofthe paper. To the best of our knowledge, this is the first query-timeimprovement to LSH that is both provable and practical.Our improvement consists of two new algorithms. The first al-gorithm, called ACHash, works by computing the following se-quence applied to the input vector: r andomly flip the signs of eachcoordinate of the vector, apply the Hadamard transform, and com-pute the inner product with a sparse Gaussian vector. This particu-lar sequence of operations is directly inspired by the fast Johnson–1073Lindenstrauss transform proposed by Ailon and Chazelle [1] for di-mension reduction. The second algorithm, called DHHash, worksby computing the following sequence applied to the input vector:randomly flip the signs of each coordinate of the vector, apply theHadamard transform, multiply each coordinate by independent unitGaussians, and apply another Hadamard transform. DHHash, eventhough it has only comparable theoretical guarantees to ACHash,performs much better in practice. The use of a Gaussian operatorsandwiched by Hadamard transforms is a novel step in DHHash.Clearly, both the algorithms exploit the computational advan-tages of the Hadamard t ransform and that is where we gain theimproved query time. The space requirements of naive


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2-3 out of 9 pages.

Please select your school