Robert Sedgewick and Kevin Wayne • Copyright © 2005 • http://www.Princeton.EDU/~cos2266. Strings2StringsString. Ordered list of characters.Ex. Natural languages, Java programs, genomic sequences, …."The digital information that underlies biochemistry, cell biology, anddevelopment can be represented by a simple string G's, A's, T's and C's.This string is the root data structure of an organism's biology." -M. V. Olson3Using Strings in JavaString concatenation. Append one string to end of another string.Substring. Extract a contiguous list of characters from a string.String s = "strings"; // s = "STRINGS"char c = s.charAt(2); // c = 'R'String t = s.substring(2, 7); // t = "RINGS"String u = s + t; // u = "STRINGSRINGS"S T R I N G S0 1 2 3 4 5 64String Implementation in JavaMemory. 28 + 2N bytes for virgin string!public final class String implements Comparable<String> { private char[] value; // characters private int offset; // index of first char into array private int count; // length of string private int hash; // cache of hashCode private String(int offset, int count, char[] value) { this.offset = offset; this.count = count; this.value = value; } public String substring(int from, int to) { return new String(offset + from, to - from, value); } . . .}could use byte array instead of String to save space5String vs. StringBuilderString. [immutable] Fast substring, slow concatenation.StringBuilder. [mutable] Slow substring, fast append.public static String reverse(String s) { String r = ""; for (int i = s.length() - 1; i >= 0; i--) r += s.charAt(i); return r;}quadratic timepublic static String reverse(String s) { StringBuilder r = new StringBuilder(""); for (int i = s.length() - 1; i >= 0; i--) r.append(s.charAt(i)); return r.toString();}linear timeRobert Sedgewick and Kevin Wayne • Copyright © 2005 • http://www.Princeton.EDU/~cos226Radix SortingReference: Chapter 13, Algorithms in Java, 3rd Edition, Robert Sedgewick.7Radix SortingRadix sorting.! Specialized sorting solution for strings.! Same ideas for bits, digits, etc.Applications.! Sorting strings.! Full text indexing.! Plagiarism detection.! Burrows-Wheeler transform. [see data compression]! Computational molecular biology.8An Application: Redundancy DetectorLongest repeated substring.! Given a string of N characters, find the longest repeated substring.! Ex: a a c a a g t t t a c a a g c! Application: computational molecular biology.Dumb brute force.! Try all indices i and j, and all match lengths k, and check.! O(W N3 ) time, where W is length of longest match.a a c a a g t t t a c a a g cijk k9An Application: Redundancy DetectorLongest repeated substring.! Given a string of N characters, find the longest repeated substring.! Ex: a a c a a g t t t a c a a g c! Application: computational molecular biology.Brute force.! Try all indices i and j for start of possible match, and check.! O(W N2 ) time, where W is length of longest match.ija a c a a g t t t a c a a g c10A Sorting SolutionSuffix sort.! Form N suffixes of original string.! Sort to bring longest repeated substrings together.a a c a a g t t t a c a a g ca c a a g t t t a c a a g cc a a g t t t a c a a g ca a g t t t a c a a g ca g t t t a c a a g cg t t t a c a a g ct t t a c a a g ct t a c a a g ct a c a a g ca c a a g cc a a g ca a g ca g cg cca a c a a g t t t a c a a g ca c a a g t t t a c a a g cc a a g t t t a c a a g ca a g t t t a c a a g ca g t t t a c a a g cg t t t a c a a g ct t t a c a a g ct t a c a a g ct a c a a g ca c a a g cc a a g ca a g ca g cg cc11String SortingNotation.! String = variable length sequence of characters.! W = max # characters per string.! N = # input strings.! R = radix.Java syntax.! Array of strings: String[] a! Number of strings: N = a.length! The ith string: a[i]! The dth character of the ith string: a[i].charAt(d)! Strings to be sorted: a[0], …, a[N-1]256 for extended ASCII, 65,536 for original UNICODE 12Suffix Sorting: Java ImplementationJava implementation.public class LRS { public static void main(String[] args) { String s = StdIn.readAll(); int N = s.length(); String[] suffixes = new String[N]; for (int i = 0; i < N; i++) suffixes[i] = s.substring(i, N); Arrays.sort(suffixes); System.out.println(lcp(suffixes)); }}read inputcreate suffixes(linear time)sort and find longest match(bottleneck)longest common prefix of adjacent strings% java LRS < mobydick.txt,- Such a funny, sporty, gamy, jesty, joky, hoky-poky lad, is the Ocean, oh! Th13String Sorting PerformanceQuicksort W N log N †9.5Worst Case Moby DickBrute W N236,000 §Suffix (sec)String SortW = max length of string.N = number of strings.1.2 million for Moby Dick§ estimate† probabilistic guarantee15Key Indexed CountingKey indexed counting.! Count frequencies of each letter. [0th character]d a ba d dc a bf a df e eb a dd a db e ef e db e de b ba c e01234567891011a 0b 2c 3d 1e 2f 1g 3int N = a.length;int[] count = new int[256+1];for (int i = 0; i < N; i++) { char c = a[i].charAt(d); count[c+1]++;}frequenciescountad = 016Key Indexed CountingKey indexed counting.! Count frequencies of each letter. [0th character]! Compute cumulative frequencies.for (int i = 1; i < 256; i++) count[i] += count[i-1]; cumulative countsa 0b 2c 5d 6e 8f 9gd a ba d dc a bf a df e eb a dd a db e ef e db e de b ba c e0123456789101112a counta 0b 2c 3d 1e 2f 1g 317Key Indexed CountingKey indexed counting.! Count frequencies of each letter. [0th character]! Compute cumulative frequencies.! Use cumulative frequencies to rearrange strings.a 0b 2c 5d 6e 8f 9a d da c eb a db e eb e dc a bd a bd a de b bf a df e ef e d01234567891011gd a ba d dc a bf a df e eb a dd a db e ef e db e de b ba c e0123456789101112for (int i = 0; i < N; i++) { char c = a[i].charAt(d); temp[count[c]++] = a[i];} rearrangetempa count30Key Indexed CountingKey indexed counting.! Count frequencies of each letter. [0th character]! Compute cumulative frequencies.! Use cumulative frequencies to rearrange strings.a 2b 5c 6d 8e 9f 1201234567891011a d da c eb a db e eb e dc a bd a bd a de b bf a df e ef e d01234567891011g 12for (int i = 0; i < N; i++) a[i] = temp[i]; copy backtempa counta d da c eb a db e eb e dc a bd a bd a de b bf a df e ef e d31LSD Radix SortLeast significant digit radix sort. Ancient method used for card-sorting.Lysergic Acid Diethylamide, Circa 1960 Card Sorter, Circa 196032LSD Radix
View Full Document