Fast Algorithms for Sorting and Searching Strings

Home> Academic Documents> Fast Algorithms for Sorting and Searching Strings

DOC PREVIEW

This preview shows page 1-2-3 out of 10 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 10 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 10 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 10 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 10 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Fast Algorithms for Sorting and Searching StringsJon L. Bentley* Robert Sedgewick#AbstractWe present theoretical algorithms for sorting andsearching multikey data, and derive from them practical Cimplementations for applications in which keys are charac-ter strings. The sorting algorithm blends Quicksort andradix sort; it is competitive with the best known C sortcodes. The searching algorithm blends tries and binarysearch trees; it is faster than hashing and other commonlyused search methods. The basic ideas behind the algo-rithms date back at least to the 1960s, but their practicalutility has been overlooked. We also present extensions tomore complex string problems, such as partial-matchsearching.1. IntroductionSection 2 briefly reviews Hoare’s [9] Quicksort andbinary search trees. We emphasize a well-known isomor-phism relating the two, and summarize other basic facts.The multikey algorithms and data structures are pre-sented in Section 3. Multikey Quicksort orders a set of nvectors with k components each. Like regular Quicksort, itpartitions its input into sets less than and greater than agiven value; like radix sort, it moves on to the next fieldonce the current input is known to be equal in the givenfield. A node in a ternary search tree represents a subset ofvectors with a partitioning value and three pointers: one tolesser elements and one to greater elements (as in a binarysearch tree) and one to equal elements, which are then pro-cessed on later fields (as in tries). Many of the structuresand analyses have appeared in previous work, but typicallyas complex theoretical constructions, far removed frompractical applications. Our simple framework opens thedoor for later implementations.The algorithms are analyzed in Section 4. Many of theanalyses are simple derivations of old results.Section 5 describes efficient C programs derived fromthe algorithms. The first program is a sorting algorithm__________________* Bell Labs, Lucent Technologies, 700 Mountain Avenue, Murray Hill,NJ 07974; [email protected].# Princeton University, Princeton, NJ, 08544; [email protected] is competitive with the most efficient string sortingprograms known. The second program is a symbol tableimplementation that is faster than hashing, which is com-monly regarded as the fastest symbol table implementa-tion. The symbol table implementation is much morespace-efficient than multiway trees, and supports moreadvanced searches.In many application programs, sorts use a Quicksortimplementation based on an abstract compare operation,and searches use hashing or binary search trees. These donot take advantage of the properties of string keys, whichare widely used in practice. Our algorithms provide a nat-ural and elegant way to adapt classical algorithms to thisimportant class of applications.Section 6 turns to more difficult string-searching prob-lems. Partial-match queries allow ‘‘don’t care’’ characters(the pattern ‘‘so.a’’, for instance, matches soda and sofa).The primary result in this section is a ternary search treeimplementation of Rivest’s partial-match searching algo-rithm, and experiments on its performance. ‘‘Near neigh-bor’’ queries locate all words within a given Hamming dis-tance of a query word (for instance, code is distance 2from soda). We give a new algorithm for near neighborsearching in strings, present a simple C implementation,and describe experiments on its efficiency.Conclusions are offered in Section 7.2. BackgroundQuicksort is a textbook divide-and-conquer algorithm.To sort an array, choose a partitioning element, permutethe elements such that lesser elements are on one side andgreater elements are on the other, and then recursively sortthe two subarrays. But what happens to elements equal tothe partitioning value? Hoare’s partitioning method isbinary: it places lesser elements on the left and greater ele-ments on the right, but equal elements may appear oneither side.Algorithm designers have long recognized the desir-ability and difficulty of a ternary partitioning method.Sedgewick [22] observes on page 244: ‘‘Ideally, we wouldlike to get all [equal keys] into position in the file, with all- 2 -the keys with a smaller value to their left, and all the keyswith a larger value to their right. Unfortunately, noefficient method for doing so has yet been devised....’’Dijkstra [6] popularized this as ‘‘The Problem of the DutchNational Flag’’: we are to order a sequence of red, whiteand blue pebbles to appear in their order on Holland’sensign. This corresponds to Quicksort partitioning whenlesser elements are colored red, equal elements are white,and greater elements are blue. Dijkstra’s ternary algorithmrequires linear time (it looks at each element exactly once),but code to implement it has a significantly larger constantfactor than Hoare’s binary partitioning code.Wegner [27] describes more efficient ternary partition-ing schemes. Bentley and McIlroy [2] present a ternarypartition based on this counterintuitive loop invariant:= < ? > =a b c dThe main partitioning loop has two inner loops. The firstinner loop moves up the index b: it scans over lesser ele-ments, swaps equal elements to a, and halts on a greaterelement. The second inner loop moves down the index ccorrespondingly: it scans over greater elements, swapsequal elements to d, and halts on a lesser element. Themain loop then swaps the elements pointed to by b and c,increments b and decrements c, and continues until b andc cross. (Wegner proposed the same invariant, but main-tained it with more complex code.) Afterwards, the equalelements on the edges are swapped to the middle of thearray, without any extraneous comparisons. This code par-titions an n-element array using n − 1 comparisons.Quicksort has been extensively analyzed by authorsincluding Hoare [9], van Emden [26], Knuth [11], andSedgewick [23]. Most detailed analyses involve the har-monic numbers Hn=Σ1≤i≤n1/ i.Theorem 1. [Hoare] A Quicksort that partitionsaround a single randomly selected element sorts n dis-tinct items in 2nHn+ O(n)∼∼1. 386 n lg n expectedcomparisons.A common variant of Quicksort partitions around themedian of a random sample.Theorem 2. [van Emden] A Quicksort that partitionsaround the median of 2t + 1 randomly selected ele-ments sorts n distinct items in 2nHn/ (H2t + 2− Ht + 1)+ O(n) expected comparisons.By increasing t, we can push the


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2-3 out of 10 pages.

Please select your school