DOC PREVIEW
UMD CMSC 423 - Lecture 7

This preview shows page 1-2-3-4-5 out of 15 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 15 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 15 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 15 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 15 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 15 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 15 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

CMSC423: Bioinformatic Algorithms, Databases and ToolsLecture 7Exact string matchingSuffix treesSuffix arraysCMSC423 Fall 2008 2Basic idea: 1-D dynamic programmingCan Z[i] be computed with the help of Z[j] for j < i?ijAssume there exists j < i, s.t. j + Z[j] – 1 > ithen Z[i – j + 1] provides information about Z[i]If there is no such j, simply compare characters T[i..] to T[0..]since they have not been seen before.i-j+1Z[j]CMSC423 Fall 2008 3Three casesLet j < i be the coordinate that maximizes j + Z[j] – 1(intuitively, the Z[j] that extends the furthest)I. Z[i – j + 1] < Z[j] – i + j – 1 => Z[i] = Z[i – j + 1] iji-j+1Z[j]iji-j+1Z[j]iji-j+1Z[j]A CCII. Z[i – j + 1] > Z[j] – i + j – 1 => Z[i] = Z[j] –i + j - 1III. Z[i – j + 1] = Z[j] – i + j – 1 => Z[i] = ??, compare from i + Z[i – j + 1]A C???CMSC423 Fall 2008 4Time complexity analysis• Why do these tricks save us time?1. Cases I and II take constant time per Z-value computed – total time spent in these cases is O(n)2. Case III might involve 1 or more comparisons per Z-value however: - every successful comparison (match) shifts the rightmost character that has been visited - every unsuccessful comparison terminates the “round” and algorithm moves on to the next Z-value total time spent in III cannot be more than # of characters in the textOverall running time is O(n)CMSC423 Fall 2008 5Space complexity?• If using Z algorithm for matching, how many Z values do we need to store?PPPPPPPPPP$TTTTTTTTTTTTTTTTTTTTTTTT• Only need to remember Z-values for pattern and the “farthest reaching Z-value” (Z[j] in what we discussed before)CMSC423 Fall 2008 6Z algorithm, not just for matching•Lempel-Ziv compression (e.g. gzip)• Note: other exact matching algorithms used for data compression (e.g. Burrows-Wheeler transform relates to suffix arrays)Z[i] i i + Z[i] - 1nif Z[i] = 0, just send/store the character T[i], otherwise,instead of sending T[i..i+Z[i] – 1] (Z[i] – 1 characters/bytes)simply send Z[i] (one number)CMSC423 Fall 2008 7Knuth-Morris-Pratt algorithmGiven a Pattern and a Text, preprocess the Pattern to computesp[i] = length of longest prefix of P that matches a suffix of P[0..i]isp[i]PTijACPP'Compare P with T until finding a mis-match (at coordinate i + 1in P and j + 1 in T). Shift P such that first sp[i] characters match T[j – sp[i] + 1 .. j]. Continue matching from T[i+1], P[sp[i]+1]CMSC423 Fall 2008 8Boyer-Moore algorithmPreprocess the pattern, computing, for every i, L[i] = largestcoordinate < n, s.t. P[i..n] matches a suffix of P[1..L[i]] (invertedZ function)iL[i]PTAiPCjCP'Match the pattern backwards (starting at the right) until mismatch.Shift the pattern such that P[L[i] – n + i + 1] matches at T[j]Repeat.Bad character rule: find character T[j – 1] in P and shift until it matches. Choose the longest shift (btwn. suffix & char. rules)ACMSC423 Fall 2008 9Suffix treesCMSC423 Fall 2008 10Intro to suffix trees• Used in fast exact matching•Basic idea: extend a trie – structure for storing multiple stringstheirtherewaswhentwheirreashenCMSC423 Fall 2008 11Suffix tree• Extends trie of all suffixes of a string 1 ATCATG 2 TCATG 3 CATG 4 ATG 5 TG 6 GATG TCATGGCATGGCATG4 165 23CMSC423 Fall 2008 12Suffix tree ...cont• To store in linear time – just store range in sequence instead of string•To ensure suffixes end at leaves, add $ char at end of string•ATCATG$AT1,2G$6,7T2,2CATG$3,7G$6,7CATG$3,7G$6,7CATG$3,74 165 23$7,77CMSC423 Fall 2008 13Suffix links• Link every node labeled aS for some string S to node labeled S (note – it always exists)AT1,2G$6,7T2,2CATG$3,7G$6,7CATG$3,7G$6,7CATG$3,74 165 23$7,77CMSC423 Fall 2008 14Suffix trees for matching• Suffix trees use O(n) space• Suffix trees can be constructed in O(n) time•Is CAT part of ATCATG ?•Match from root, char by char• If run out of query – found match•otherwise, there is no match•intuition: CAT is the prefixof some suffixAT1,2G$6,7T2,2CATG$3,7G$6,7CATG$3,7G$6,7CATG$3,74 165 23$7,77CMSC423 Fall 2008 15Suffix links – useful for substring matches• Does any part of AGATG match string AGCAGT?AG1,2T$6,7G2,2CAGT$3,7T$6,7CAGT$3,7T$6,7CAGT$3,74 165


View Full Document

UMD CMSC 423 - Lecture 7

Documents in this Course
Midterm

Midterm

8 pages

Load more
Download Lecture 7
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture 7 and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture 7 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?