DOC PREVIEW
UMD CMSC 423 - Suffix Arrays

This preview shows page 1-2-3-4 out of 13 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 13 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 13 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 13 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 13 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 13 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Suffix ArraysCMSC 423Suffix Arrays•Even though Suffix Trees are O(n) space, the constant hidden by the big-Oh notation is somewhat “big”: ≈ 20 bytes / character in good implementations.•If you have a 10Gb genome, 20 bytes / character = 200Gb to store your suffix tree. “Linear” but large.•Suffix arrays are a more efficient way to store the suffixes that can do most of what suffix trees can do, but just a bit slower.•Slight space vs. time tradeoff.Example Suffix Array•Idea: lexicographically sort all the suffixes.•Store the starting indices of the suffixes in an array.s = attcatg$attcatg$ttcatg$tcatg$catg$atg$tg$g$$12345678$atg$attcatg$catg$g$tcatg$tg$ttcatg$85147362suffix of sindex of suffixsort the suffixes alphabeticallythe indices just “come along for the ride”Example Suffix Array•Idea: lexicographically sort all the suffixes.•Store the starting indices of the suffixes in an array.s = attcatg$attcatg$ttcatg$tcatg$catg$atg$tg$g$$1234567885147362suffix of sindex of suffixsort the suffixes alphabeticallythe indices just “come along for the ride”Another Example Suffix Array•Idea: lexicographically sort all the suffixes.•Store the starting indices of the suffixes in an array.s = cattcat$cattcat$attcat$ttcat$tcat$cat$at$t$$12345678$at$attcat$cat$cattcat$t$tcat$ttcat$86251743suffix of sindex of suffixsort the suffixes alphabeticallythe indices just “come along for the ride”Another Example Suffix Array•Idea: lexicographically sort all the suffixes.•Store the starting indices of the suffixes in an array.s = cattcat$cattcat$attcat$ttcat$tcat$cat$at$t$$1234567886251743suffix of sindex of suffixsort the suffixes alphabeticallythe indices just “come along for the ride”Search via Suffix Arrays•Does string “at” occur in s?•Binary search to find “at”.•What about “tt”?s = cattcat$86251743$at$attcat$cat$cattcat$t$tcat$ttcat$√Counting via Suffix Arrays•How many times does “at” occur in the string?•All the suffixes that start with “at” will be next to each other in the array.•Find one suffix that starts with “at” (using binary search).•Then count the neighboring sequences that start with at.s = cattcat$$at$attcat$cat$cattcat$t$tcat$ttcat$86251743K-mer countingProblem: Given a string s, an integer k, output all pairs (b, i) such that b is a length-k substring of s that occurs exactly i times. 1. Build a suffix array.2. Walk down the suffix array, keeping a CurrentCount countIf the current suffix has length < k, skip itIf the current suffix starts with the same length-k string as the previous suffix:increment CurrentCountelseoutput CurrentCount and previous length-k suffix CurrentCount := 1Output CurrentCount & length-k suffix.$at$attcat$cat$cattcat$t$tcat$ttcat$86251743k = 2CurrentCount112 1 (at,2)21 (ca,2)1 (t$,1)1 (tc,1)1 (tt,1)Constructing Suffix Arrays•Easy O(n2 log n) algorithm: sort the n suffixes, which takes O(n log n) comparisons, where each comparison takes O(n).•There are several direct O(n) algorithms for constructing suffix arrays that use very little space.• The Skew Algorithm is one that is based on divide-and-conquer.•An simple O(n) algorithm: build the suffix tree, and exploit the relationship between suffix trees and suffix arrays (next slide)Relationship Between Suffix Trees & Suffix ArraysRed #s = starting position of the suffix ending at that leafEdges leaving each node are sorted by label (left-to-right).$∑ = {$,a,c,t}atcattcat$$tcat$$tcat$$tcat$cattcat$123456788s = 6251743Leaf labels left to right: 86251743Relationship Between Suffix Trees & Suffix ArraysRed #s = starting position of the suffix ending at that leafEdges leaving each node are sorted by label (left-to-right).$∑ = {$,a,c,t}atcattcat$$tcat$$tcat$$tcat$cattcat$123456788s = 6251743Leaf labels left to right: 86251743s = cattcat$$at$attcat$cat$cattcat$t$tcat$ttcat$86251743Recap•Suffix arrays can be used to search and count substrings.•Construction:•Easily constructed in O(n2 log n)•Simple algorithms to construct them in O(n) time using possibly O(n2) space.•More complicated algorithms to construct them in O(n) time using at most O(n) space.•More space efficient than suffix trees: just storing the original string + a list of


View Full Document

UMD CMSC 423 - Suffix Arrays

Documents in this Course
Midterm

Midterm

8 pages

Lecture 7

Lecture 7

15 pages

Load more
Download Suffix Arrays
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Suffix Arrays and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Suffix Arrays 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?