Fast Approximate Similarity Search Based on Degree-Reduced Neighborhood Graphs

Home> Academic Documents> Fast Approximate Similarity Search Based on Degree-Reduced Neighborhood Graphs

DOC PREVIEW

This preview shows page 1-2-3 out of 9 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 9 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 9 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 9 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 9 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Fast Approximate Similarity SearchBased on Degree-Reduced Neighborhood GraphsKazuo AoyamaNTT Communication ScienceLaboratories, NTT Corporation2-4, Hikaridai, Seika-cho,Soraku-gun, Kyoto, 619-0237,[email protected] SaitoUniversity of Shizuoka52-1, Yada, Suruga-ku,Shizuoka, Shizuoka,422-8526, [email protected] SawadaNTT Communication ScienceLaboratories, NTT Corporation2-4, Hikaridai, Seika-cho,Soraku-gun, Kyoto, 619-0237,[email protected] UedaNTT Communication ScienceLaboratories, NTT Corporation2-4, Hikaridai, Seika-cho,Soraku-gun, Kyoto, 619-0237,[email protected] paper presents a fast approximate similarity searchmetho d for finding the most similar object to a given queryobject from an object set with a dissimilarity with a successprobability exceeding a given value. As a search index, theproposed method utilizes a degree-r educed k-nearest neigh-b or (k-DR ) graph constructed from the object set with thedissimilarity, and explores the k-DR graph along its edgesusing a greedy search (GS ) algorithm starting from multi-ple initial vertices with parallel processing. In the graph-construction stage, the str uctur al parameter k of the k -DRgraph is determined so that the probability with which atleast one search trial of those w ith multiple initial verticessucceeds is more than the given success probability. To es-timate the greedy-search success probability, we introducethe concept of a basin in the k-DR graph. The experimentalresults on a real data set verify the approximation schemeand high search performance of the proposed method anddemonstrate that it is superior to E2LSH in terms of theexp ected search cost.Categories and Subject DescriptorsI.2 [Artificial Intelligence]: Problem Solving, ControlMetho ds, and Search—graph and tree search strategies;H.3 [Information Storage and Retrieval]: InformationSearch and Retrieval—search processGeneral TermsAlgorithms, Performance, TheoryPermission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.KDD’11, August 21–24, 2011, San Diego, California, USA.Copyright 2011 ACM 978-1-4503-0813-7/11/08 ...$10.00.KeywordsSimilarity search, Index structure, Neighborho od graph,Approximate algorithm1. INTRODUCTIONA similarity search that finds valid objects similar to agiven quer y object has been a fundamental technique formany applications in a variety of fields such as data min-ing, pattern recognition, computational geometry, compu-tational biology, and statistics [5, 9]. In these fields, someapplications r equir e to obtain a solution quickly, even if it isnot always exact, rather than an exact solution after a longcomputation time. An exact similarity search, in particular,is often a time-consuming task when employing a large-scaledata set with medium to high intrinsic dimensionality esti-mated based on certain definitions [5, 12, 14]. In such cases,an approximate similarity search is a promising approachto improving search efficiency, although this comes at theexp ense of degrading the quality of the search result.Of the many approaches to achieving an approximate sim-ilarity search, we focus on a class that gives probabilisticguarantees of the quality of the search results. A representa-tive approach from this class is the locality-sensitive hashing(LSH) scheme [1, 2, 10, 21]. The LSH scheme performs a fastsimilarity search w ith a given success probability by reduc-ing the search space. More specifically, the neighborhoods ofa query object in an or iginal space are embedded in bucketsrepresented by an identical sequence of hash values obtainedby a set of hash functions such that the collision probabilityis much higher for objects that are close to each other thanfor those that are far apart. A small set of buckets with hashvalues identical to those of the query object corresponds tothe search space. The LSH scheme is an efficient technique,but it is only applicable to limited metric spaces such as aEuclidean space with a limited distance measure.There is a graph-based approach [8, 11, 17, 20] for simi-larity searches, which is applicable to various types of spacesb ecause it employs a graph representation of a given data set1055with a dissimilarity. Experimental results obtained with thisapproach have been reported for large-scale document datasets with high intrinsic dimensionality [3], high-dimensionalimage data sets [19], and a non-metric space consisting ofa Gaussian mixture model set and a Kullback-Leibler di-vergence as a dissimilarity [4]. However, the graph-basedapproach has two problems. First, the task of construct-ing a gr aph as a search index from a given data set witha dissimilarity incurs a high computational cost. For in-stance, the computational complexity of constructing a k-nearest neighbor (k-NN ) graph is O(n2) with a brute-forceapproach, where n denotes the number of objects in the dataset. The second major problem is that the accuracy of thesearch result is unclear because a graph-based approach isheuristic. The former problem of constructing a k-NN gr aphis equivalent to the all-k-nearest-neighbor problem, to whichconsiderable research efforts have been devoted [6, 18]. Wecan utilize their results for constructing the graph index.The latter problem has not really been studied. To improvethe availability of the graph-based approach, we tackle theproblem of how to control the search accuracy.This paper presents a fast approximate similarity searchmetho d based on a neighborhood-graph index. The pro-p osed method has two main algorithms: one is for construct-ing a graph as a search index and the other for searchingon the graph. The graph-construction algorithm builds asp ecial neighborhood graph, which is called a degree-reducedk-nearest neighbor (k-DR ) graph, from a given data set witha dissimilarity. As the search algorithm, we adopt a greedysearch (GS ) algorithm that efficiently finds the vertex clos-est from a query vertex by exploring the k-DR graph alongits edges. We determine the graph structural parameter kin the graph-construction stage so that the GS


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2-3 out of 9 pages.

Please select your school