U of M CSCI 8715 - Optimal Multi-Step k-Nearest Neighbor Search - D2166655

Home> Schools> University of Minnesota- Twin Cities> Computer Science (CSCI) > CSCI 8715> Optimal Multi-Step k-Nearest Neighbor Search

DOC PREVIEW

U of M CSCI 8715 - Optimal Multi-Step k-Nearest Neighbor Search

School name University of Minnesota- Twin Cities

Course Csci 8715- Spatial Databases and Applications

Pages 12

This preview shows page 1-2-3-4 out of 12 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 12 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 12 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 12 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 12 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 12 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Optimal Multi-Step k-Nearest Neighbor Search Thomas Seidl Hans-Peter Kriegel University of Munich, Germany University of Munich, Germany Institute for Computer Science Institute for Computer Science http://www.dbs.informatik.uni-muenchen.de http://www.dbs.informatik.uni-muenchen.de seidlBdbs.informatik.uni-muenchen.de [email protected] Abstract For an increasing number of modern database applica- tions, efficient support of similarity search becomes an important task. Along with the complexity of the objects such as images, molecules and mechanical parts, also the complexity of the similarity models increases more and more. Whereas algorithms that are directly based on in- dexes work well for simple medium-dimensional similar- ity distance functions, they do not meet the efficiency re- quirements of complex high-dimensional and adaptable distance functions. The use of a multi-step query process- ing strategy is recommended in these cases, and our in- vestigations substantiate that the number of candidates which are produced in the filter step and exactly evalu- ated in the refinement step is a fundamental efficiency parameter. After revealing the strong performance shortcomings of the state-of-the-art algorithm for k-nearest neighbor search [Korn et al. 19961, we present a novel multi-step algorithm which is guaranteed to pro- duce the minimum number of candidates. Experimental evaluations demonstrate the significant performance gain over the previous solution, and we observed average improvement factors of up to 120 for the number of can- didates and up to 48 for the total runtime. 1 INTRODUCTION More and more applications of database systems require the efficient support of similarity search. Examples include molecular biology [BMH 921, medical imaging [Kor+ 961, Permission to make digital or hard copies of all or part of this work for personal or classroom we is granted without fee provided that copiee e.re not made or distributed for profit or commercial sdvan- tsge and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SIGMOD ‘96 Seattle, WA, USA 0 1996 ACM 0-69791-995-5/96/006...$5.00 CAD/CAM systems (BK 971, and multimedia databases [Fal+ 941 [Haf+ 951 [SK 971 among many others [Jag 911 [AFS 931 [GM 931 [FRM 941 [ALSS 951. In all of these ap- proaches, similarity is defined in terms of a more or less com- plex similarity distance function. The smaller the similarity distance value, the more similar are two objects. Typical que- ry types are the similarity range query which is specified by a query object and a similarity distance range [0, E], and the k-nearest neighbor query which is specified by a query object and a number k for the k most similar objects to be retrieved. Whereas single-step algorithms for similarity search al- ready meet the requirements of very large databases, these so- lutions suffer from the increasing complexity of the objects and of the similarity distance functions. For classic spatial queries such as point queries and region queries, multi-step algorithms have been developed to efficiently support com- plex objects [OM 881 [BHKS 931. The paradigm of multi- step query processing has already been extended to complex similarity search, and available algorithms aim at similarity range queries [AFS 931 [FRM 941 and k-nearest neighbor queries [Kor+ 961. However, we observed a bad performance of the latter solution in our experiments on large image and biomolecular databases. Starting from a theoretical analysis of the situation, we develop a novel, optimal multi-step algo- rithm for k-nearest neighbor search that implies a minimum number of exact object distance evaluations. The paper is organized as follows: In the remainder of this introduction, we specify our problem of complex similarity search. Section 2 is dedicated to algorithms for similarity search and incremental similarity ranking that directly work on index structures in a way they are employed by our new method. In section 3, we present the available multi-step al- gorithm for k-nearest neighbor search of [Kor+ 961 including the significant efficiency shortcomings of the solution. Ex- periments substantiate that the number of candidates is a fun- damental efficiency parameter. We present our novel algo- rithm in section 4 along with a proof that it exactly generates the minimum number of candidates. The experimental eval- uation in section 5 demonstrates the substantial performance improvement before the paper is concluded in section 6. 1541.1 Simple and Complex Similarity Distance Functions Suppose the simple case that the objects of interest are rep- resented by low- or medium-dimensional feature vectors. Then, the similarity distance of two objects is typically de- fined by an appropriate distance function of the points in the feature space such as the Euclidean distance, for instance. Ex- amples include the section coding approach [BK 971, angular profiles [BMH 921, and 2-D contour features [GM 931. For such pure feature-based similarity models, single-step system architectures are appropriate: While managing the feature points by a multidimensional access method, query process- ing is performed by one of the k-nearest neighbor search al- gorithms that are available from the literature (cf. section 2). Some of the algorithms for similarity search are restricted to similarity models which are completely defined by a dis- tance function of low- or medium-dimensional feature vec- tors. In practice, however, complex similarity distance func- tions occur that may not be represented by a simple feature vector distance or that are too high in their dimensionality as they could be efficiently managed by multidimensional index structures. Examples include the Max-Morphological Dis- tance [ Kor+ 961, the approximation-based similarity of 3-D surface segments [KSS

View Full Document