Unformatted text preview:

CMSC423 Bioinformatic Algorithms Databases and Tools Lecture 8 Sequence alignment exact alignment inexact alignment dynamic programming gapped alignment CMSC423 Fall 2008 1 Suffix trees for matching Suffix trees use O n space Suffix trees can be constructed in O n time Is CAT part of ATCATG Match from root char by char If run out of query found match AT 1 2 otherwise there is no match G 6 7 7 7 CATG 3 7 T 2 2 6 intuition CAT is the prefix of some suffix CMSC423 Fall 2008 3 CATG 3 7 G G 6 7 6 7 4 1 7 5 CATG 3 7 2 2 Suffix links useful for substring matches Does any part of AGATG match string AGCAGT AG 1 2 7 7 CAGT 3 7 G 2 2 T 6 7 6 6 7 4 CMSC423 Fall 2008 3 CAGT 3 7 T T 6 7 1 7 5 CAGT 3 7 2 3 Other uses Finding repeats internal nodes with multiple children DNA that occurs in multiple places in the genome Longest common substring of two strings build suffix tree of both strings Find lowest internal node that has leaves from both strings or build suffix tree on one string and use suffix links to find longest match Note running time for matching is O Pattern not O Pattern Text though O Text was spent in pre processing CMSC423 Fall 2008 4 Why do we care Suffix trees are used for mapping reads to a genome e g personal genomics comparing genomes comparative genomics finding repeats identifying genome signatures Exact matching what to expect on exams build a suffix tree for a string answer some questions about one of the algorithms e g for Z algorithm is it necessary j be the farthest reaching Zvalue or just any Z value extending past i do something with the help of some of the algorithms e g look for repeats that occur exactly twice etc CMSC423 Fall 2008 5 Suffix arrays Suffix trees are expensive 20 bytes base Suffix arrays lexicographically sort all suffixes ATG 4 ATCATG 1 CATG 3 G6 TCATG 2 TG 5 Can quickly find the correct suffix through binary search Note much less space but longer running time incur a log n term CMSC423 Fall 2008 6 Suffix arrays and compression Burrows Wheeler transform BANANA BANANA ANANA B NANA BA sort ANA BAN NA BANA A BANAN BANANA character before the suffix BANANA A BANAN ANA BAN ANANA B BANANA NA BANA NANA BA BWT ANNB AA compress Note characters in last column occur in same order as in first column Useful for matching within BWT CMSC423 Fall 2008 7 BWT string matching Look for BANA Start at end match right to left Find character in rightmost column Identify corresponding range in first column Switch back to last column BANANA How do we know the first A BANAN A in the pattern is the 2nd 3rd A ANA BAN from the top of the matrix A ANANA B Note add l data needed BANANA of times each letter appears NA BANA before every pos n NANA BA Running time N B A ABN 0000 1000 1010 1020 1120 1121 2121 O len P operations Each may cost O log len T CMSC423 Fall 2008 8 Exact alignment recap Exact matching can be done efficiently O Text Pattern Key idea preprocess data to keep track of similar regions then use information to jump over places where no match can occur Z KMP B M CMSC423 Fall 2008 9


View Full Document

UMD CMSC 423 - Lecture 8 Sequence alignment

Documents in this Course
Midterm

Midterm

8 pages

Lecture 7

Lecture 7

15 pages

Load more
Loading Unlocking...
Login

Join to view Lecture 8 Sequence alignment and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture 8 Sequence alignment and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?