BU CS 565 - A Fast Algorithm for Indexing, Data-Mining and Visualization of Traditional and Multimedia Datasets - D1331627

Home> Schools> Boston University> Computer Science (CS) > CS 565> A Fast Algorithm for Indexing, Data-Mining and Visualization of Traditional and Multimedia Datasets

BU CS 565 - A Fast Algorithm for Indexing, Data-Mining and Visualization of Traditional and Multimedia Datasets

School name Boston University

Course Cs 565- Advanced Java Programming

Pages 12

Download Save

Unformatted text preview:

Fastlk?ap: A Fast Algorithm for Indexing, Data-Mining andVisualization of Traditional and Multimedia DatasetsChristos Faloutsos”AT&T Bell LaboratoriesMurray Hill, NJAbstractA very promising idea for fast searching in traditional andmultimedia databases is to map objects into points in k-dspace, using k feature-extraction functions, provided by adomain expert [25]. Thus, we can subsequently use highlyfine-tuned spatial access methods (SAMS), to answer severaltypes of queries, including the ‘Query By Example’ type(which translates to a range query); the ‘all pairs’ query(which translates to a spatial join[8]); the nearest-neighboror best-match query, etc.However, designing feature extraction functions can behard. It is relatively easier for a domain expert to assess thesimilarity/distance of two objects. Given only the distanceinformation though, it is not obvious how to map objectsinto points.This is exactly the topic of this paper. We describe a fastalgorithm to map objects into points in some k-dimensionalspace (k is user-defined), such that the dis-similarit ies arepreserved. There are two benefits from this mapping: (a)efficient ret rieval, in conjunction with a SAM, as discussedbefore and (b) visualization and data-mining: the objectscan now be plotted as points in 2-d or 3-d space, revealingpot ential clusters, correlations among attributes and otherregularities that data-mining is looklng for.We introduce an older method from pat te~n recognition,namely, MultLDirnensionral Scaling (MDS) [51]; althoughunsuitable for indexing,we use it as yardstick for ourmet hod. Then, we propose a much faster algorithm to solvethe problem in hand, while in addition it allows for indexing.Experiments on real and synthetic data indeed show thatthe proposed algorithm is significantly faster than MDS,(being linear, as opposed to quadratic, on the database size*On leave from Univ. of Maryland, College Park. This work..-was partially supported by the Institute of Systems Research andby the National Science Foundation underGrants No. CDR-8803012, EEC-94-02384, IRI-8958546 and IRI-9205273), withmatching funds from Empress Software Inc.and ThinkingMachines Inc.Permission to copy without fee all or part of this material isgranted provided that the copies are not made or distributed fordirect commercial advantage, the ACM copyright notice and thetitle of the publication and its date appear, and notice is giventhat copyin is by permission of the Association of Computing?Machinery. o copy otherwise, or to republish, requiresa fee and/or specific permission.SIGMOD’95,San Jose, CA USA@ 1995 ACM 0-89791-731 -6/95/0005 ..$3.50King-Ip (David) LinDept. of Computer ScienceUniv. of Maryland, College ParkN), while it manages to preserve distances and the overallstructure of the data-set.1IntroductionThe objective of this work is to provide a retrievaland visualization tool for large collections of traditionalas well as ‘exotic’ and multimedia datasets. Anexcellent idea, suggested by Jagadish [25], was torely on domain experts to derive k feature-extractionfunctions, thus mapping each object into a point ink-dimensional space. Then the problem is reduced tostoring, retrieving and displaying k-dimensional points,for which there is a plethora of algorithms available.However, it is not always easy to derive the abovefeature-extraction functions.Consider the case, eg.,of typed English words, where the distance functionis the editing distance (minimum number of insertion,deletions and substitutions to transform one stringto the other). It is not clear which the featuresshould be in this case. Similarly, in matching digitizedvoice excerpts, we typically have to do some time-warping [44], which makes it difficult to design feature-extraction functions.Overcoming these difficulties is exactly the motiva-tion behind this work. Generalizing the approach by Ja-gadish, we try to map objects into k-dimensional points,assuming that a domain expert has only provided uswith a distance/dis-similarity function D(*, *). Noticethat this setting includes the case of features, by usingeg., the Euclidean distance between two feature vectorsas the distance function between the corresponding ob-jects.Given such a set of objects and the distance function220, users would like (a) to find objects similar to agiven query object, (b) to find the pairs of objects thatare most similar to each other, as well as (c) to visualizethe distribution of objects into some appropriatelychosen space, in order to check for clusters and otherregularities.Next, we shall use the following terminology:Definition 1 The k-dimensional point Pi that corre-163spends to the object Oi, will be called ‘the image’ ofobject Oi. That is, P; ~ (z; ,1, z;, z, . . .,x~,~)Definition 2 The k-dimensional space containing the‘images’ will be called target space.Some of the applications that motivated the presentwork are listed next. Some distance functions are alsodescribed.Image and, in general, multimedia databases: Ina collection of shapes [25] we would like to findshapes similar to a give one; in a collection ofcolor images, we would like to find images withsimilar colors, shapes or texture [35]. There weused the Euclidean distance between appropriatelyselected feature vectors (color attributes, momentsof inertia for shape, etc.) Search-by-content ishighly desirable in multimedia databases, with audio(voice, music), video etc. [33]. For example, usersmight want to retrieve, music scores, or video clipsthat are similar to a target music score or videoclip, Once the similarity (or dis-similarity) functionhas been determined, our proposed method can beimmediately applied.Medical databases, where l-d objects (eg., ECGS),2-d images (eg., X-rays) and 3-d images (eg., MRIbrain scans) [5] are stored. Ability to retrievequickly past cases with similar symptoms would bevaluable for diagnosis, as well as for medical teachingand research purposes.Notice that the distancefunctions are complicated, typically requiring somewarping of the two images, to make sure thatthe anatomical structures (eg., bones) are properlyaligned, before we consider the differences [50]. Thiswarping makes it difficult to find features that wouldadequately describe each image (and therefore, mapit into a point in feature space).Time series, with, eg. financial data, such as stockprices, sales numbers etc., or scientific databases,with time series of sensor data, weather[11],geological,

View Full Document