UCF COT 4810 - A Guided Tour to Approximate String Matching - D621989

Home> Schools> University of Central Florida> Computer Theory (COT) > COT 4810> A Guided Tour to Approximate String Matching

DOC PREVIEW

UCF COT 4810 - A Guided Tour to Approximate String Matching

School name University of Central Florida

Course Cot 4810- Topics in Computer Science

Pages 58

This preview shows page 1-2-3-4-27-28-29-30-55-56-57-58 out of 58 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 58 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 58 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 58 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 58 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 58 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 58 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 58 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 58 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 58 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 58 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 58 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 58 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 58 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

A Guided Tour to Approximate String MatchingGONZALO NAVARROUniversity of ChileWe survey the current techniques to cope with the problem of string matching thatallows errors. This is becoming a more and more relevant issue for many fast growingareas such as information retrieval and computational biology. We focus on onlinesearching and mostly on edit distance, explaining the problem and its relevance, itsstatistical behavior, its history and current developments, and the central ideas of thealgorithms and their complexities. We present a number of experiments to compare theperformance of the different algorithms and show which are the best choices. Weconclude with some directions for future work and open problems.Categories and Subject Descriptors: F.2.2 [Analysis of algorithms and problemcomplexity]: Nonnumerical algorithms and problems—Pattern matching,Computations on discrete structures; H.3.3 [Information storage and retrieval]:Information search and retrieval—Search processGeneral Terms: AlgorithmsAdditional Key Words and Phrases: Edit distance, Levenshtein distance, online stringmatching, text searching allowing errors1. INTRODUCTIONThis work focuses on the problem of stringmatching that allows errors, also calledapproximate string matching. The generalgoal is to perform string matching of a pat-tern in a text where one or both of themhave suffered some kind of (undesirable)corruption. Some examples are recoveringthe original signals after their transmis-sion over noisy channels, finding DNA sub-sequences after possible mutations, andtext searching where there are typing orspelling errors.Partially supported by Fondecyt grant 1-990627.Author’s address: Department of Computer Science, University of Chile, Blanco Erncalada 2120, Santiago,Chile, e-mail: [email protected] to make digital or hard copies of part or all of this work for personal or classroom use is grantedwithout fee provided that copies are not made or distributed for profit or direct commercial advantage andthat copies show this notice on the first page or initial screen of a display along with the full citation.Copyrights for components of this work owned by others than ACM must be honored. Abstracting withcredit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use anycomponent of this work in other works, requires prior specific permission and/or a fee. Permissions maybe requested from Publications Dept, ACM Inc., 1515 Broadway, New York, NY 10036 USA, fax +1 (212)869-0481, or [email protected]°2001 ACM 0360-0300/01/0300-0031 $5.00The problem, in its most general form,is to find a text where a text given pat-tern occurs, allowing a limited number of“errors” in the matches. Each applicationuses a different error model, which defineshow different two strings are. The idea forthis “distance” between strings is to makeit small when one of the strings is likely tobe an erroneous variant of the other underthe error model in use.The goal of this survey is to presentan overview of the state of the art in ap-proximate string matching. We focus ononline searching (that is, when the textACM Computing Surveys, Vol. 33, No. 1, March 2001, pp. 31–88.32 G. Navarrocannot be preprocessed to build an in-dex on it), explaining the problem and itsrelevance, its statistical behavior, its his-tory and current developments, and thecentral ideas of the algorithms and theircomplexities. We also consider some vari-ants of the problem. We present a num-ber of experiments to compare the per-formance of the different algorithms andshow the best choices. We conclude withsome directions for future work and openproblems.Unfortunately, the algorithmic nature ofthe problem strongly depends on the typeof “errors” considered, and the solutionsrange from linear time to NP-complete.The scope of our subject is so broad that weare forced to narrow our focus on a subsetof the possible error models. We consideronly those defined in terms of replacingsome substrings by others at varying costs.In this light, the problem becomes mini-mizing the total cost to transform the pat-tern and its occurrence in text to makethem equal, and reporting the text posi-tions where this cost is low enough.One of the best studied cases of this er-ror model is the so-called edit distance,which allows us to delete, insert and sub-stitute simple characters (with a differentone) in both strings. If the different oper-ations have different costs or the costs de-pend on the characters involved, we speakof general edit distance. Otherwise, if allthe operations cost 1, we speak of simpleedit distance or just edit distance (ed). Inthis last case we simply seek for the min-imum number of insertions, deletions andsubstitutions to make both strings equal.For instance ed ("survey,""surgery") =2. The edit distance has received a lotof attention because its generalized ver-sion is powerful enough for a wide rangeof applications. Despite the fact thatmost existing algorithms concentrate onthe simple edit distance, many of themcan be easily adapted to the generalizededit distance, and we pay attention tothis issue throughout this work. More-over, the few algorithms that exist forthe general error model that we con-sider are generalizations of edit distancealgorithms.On the other hand, most of the algo-rithms designed for the edit distance areeasily specialized to other cases of inter-est. For instance, by allowing only in-sertions and deletions at cost 1, we cancompute the longest common subsequence(LCS) between two strings. Another sim-plification that has received a lot of atten-tion is the variant that allows only substi-tutions (Hamming distance).An extension of the edit distance en-riches it with transpositions (i.e. a sub-stitution of the form ab → ba at cost 1).Transpositions are very important in textsearching applications because they aretypical typing errors, but few algorithmsexist to handle them. However, many algo-rithms for edit distance can be easily ex-tended to include transpositions, and wekeep track of this fact in this work.Since the edit distance is by far thebest studied case, this survey focuses ba-sically on the simple edit distance. How-ever, we also pay attention to extensionssuch as generalized edit distance, trans-positions and general substring substitu-tion, as well as to simplifications such asLCS and Hamming distance. In addition,we also pay attention to some

View Full Document