Suffix ArraysCMSC 423Suffix Arrays•Even though Suffix Trees are O(n) space, the constant hidden by the big-Oh notation is somewhat “big”: ≈ 20 bytes / character in good implementations.•If you have a 10Gb genome, 20 bytes / character = 200Gb to store your suffix tree. “Linear” but large.•Suffix arrays are a more efficient way to store the suffixes that can do most of what suffix trees can do, but just a bit slower.•Slight space vs. time tradeoff.Example Suffix Array•Idea: lexicographically sort all the suffixes.•Store the starting indices of the suffixes in an array.s = attcatg$attcatg$ttcatg$tcatg$catg$atg$tg$g$$12345678$atg$attcatg$catg$g$tcatg$tg$ttcatg$85147362suffix of sindex of suffixsort the suffixes alphabeticallythe indices just “come along for the ride”Example Suffix Array•Idea: lexicographically sort all the suffixes.•Store the starting indices of the suffixes in an array.s = attcatg$attcatg$ttcatg$tcatg$catg$atg$tg$g$$1234567885147362suffix of sindex of suffixsort the suffixes alphabeticallythe indices just “come along for the ride”Another Example Suffix Array•Idea: lexicographically sort all the suffixes.•Store the starting indices of the suffixes in an array.s = cattcat$cattcat$attcat$ttcat$tcat$cat$at$t$$12345678$at$attcat$cat$cattcat$t$tcat$ttcat$86251743suffix of sindex of suffixsort the suffixes alphabeticallythe indices just “come along for the ride”Another Example Suffix Array•Idea: lexicographically sort all the suffixes.•Store the starting indices of the suffixes in an array.s = cattcat$cattcat$attcat$ttcat$tcat$cat$at$t$$1234567886251743suffix of sindex of suffixsort the suffixes alphabeticallythe indices just “come along for the ride”Search via Suffix Arrays•Does string “at” occur in s?•Binary search to find “at”.•What about “tt”?s = cattcat$86251743$at$attcat$cat$cattcat$t$tcat$ttcat$√Counting via Suffix Arrays•How many times does “at” occur in the string?•All the suffixes that start with “at” will be next to each other in the array.•Find one suffix that starts with “at” (using binary search).•Then count the neighboring sequences that start with at.s = cattcat$$at$attcat$cat$cattcat$t$tcat$ttcat$86251743K-mer countingProblem: Given a string s, an integer k, output all pairs (b, i) such that b is a length-k substring of s that occurs exactly i times. 1. Build a suffix array.2. Walk down the suffix array, keeping a CurrentCount countIf the current suffix has length < k, skip itIf the current suffix starts with the same length-k string as the previous suffix:increment CurrentCountelseoutput CurrentCount and previous length-k suffix CurrentCount := 1Output CurrentCount & length-k suffix.$at$attcat$cat$cattcat$t$tcat$ttcat$86251743k = 2CurrentCount112 1 (at,2)21 (ca,2)1 (t$,1)1 (tc,1)1 (tt,1)Constructing Suffix Arrays•Easy O(n2 log n) algorithm: sort the n suffixes, which takes O(n log n) comparisons, where each comparison takes O(n).•There are several direct O(n) algorithms for constructing suffix arrays that use very little space.• The Skew Algorithm is one that is based on divide-and-conquer.•An simple O(n) algorithm: build the suffix tree, and exploit the relationship between suffix trees and suffix arrays (next slide)Relationship Between Suffix Trees & Suffix ArraysRed #s = starting position of the suffix ending at that leafEdges leaving each node are sorted by label (left-to-right).$∑ = {$,a,c,t}atcattcat$$tcat$$tcat$$tcat$cattcat$123456788s = 6251743Leaf labels left to right: 86251743Relationship Between Suffix Trees & Suffix ArraysRed #s = starting position of the suffix ending at that leafEdges leaving each node are sorted by label (left-to-right).$∑ = {$,a,c,t}atcattcat$$tcat$$tcat$$tcat$cattcat$123456788s = 6251743Leaf labels left to right: 86251743s = cattcat$$at$attcat$cat$cattcat$t$tcat$ttcat$86251743Recap•Suffix arrays can be used to search and count substrings.•Construction:•Easily constructed in O(n2 log n)•Simple algorithms to construct them in O(n) time using possibly O(n2) space.•More complicated algorithms to construct them in O(n) time using at most O(n) space.•More space efficient than suffix trees: just storing the original string + a list of
View Full Document