Unformatted text preview:

CMSC423 Bioinformatic Algorithms Databases and Tools Alignment heuristics CMSC423 Fall 2009 1 Heuristics What if limit the of differences allowed E g we expect the sequences to be very similar Compute banded alignment stay within of differences k from the diagonal Optimal alignment cannot stray too far from diagonal k k O km running time and space What if we do not know k Do binary search to find it CMSC423 Fall 2009 2 Exclusion methods Assume P must match T with at most k errors Find places in T where P cannot match Split P into floor n k 1 sized chunks If P matches T with less than k errors at least one chunk matches with no errors Use any exact matching algorithm to find places where a chunk matches T then run dynamic programming in that vicinity Running time on average O m CMSC423 Fall 2009 3 Exclusion methods Putative alignment Text Pattern Exact match CMSC423 Fall 2009 4 Famous approaches FASTA Pearson et al Take all k mers substrings of length k from Pattern and identify whether and where they match in the Text Assume the k mer starting at pos n i in Pattern matches at position j in Text remember j i the diagonal on which the match occured Identify heavy diagonals diagonals where many k mers match then refine the diagonals with Smith Waterman Also look for off diagonal matches to account for gaps CMSC423 Fall 2009 5 Famous approaches BLAST Altschul et al Find short k mer matches Also search for possible inexact matches e g all k mers within 1 difference from current one Extend exact matches with Smith Waterman algorithm Assign probabilistic scores to matches what is the probability of finding a match with the same S W alignment score just by chance e g matching a random string CMSC423 Fall 2009 6 Aside viewing alignments with dot plots axes two sequences genomes dots regions that match in the two genomes Chaining approach Extends the FASTA idea Search for exact matches Find the longest consistent chain of exact matches Fill in the gaps in the chain using Smith Waterman This is the approach used by MUMmer Delcher et al MUM maximally unique match see mummer sourceforge net CMSC423 Fall 2009 8 Chaining in 1 D Input multiple overlapping intervals on a line Output highest weight set of non overlapping intervals Weight could be length of interval or Smith Waterman score etc CMSC423 Fall 2009 9 Chaining in 1D Basic idea dynamic programming V j weight of best chain ending with interval j V j maxk j intervals k j do not overlap V k weight j i e find all possible ways of building a chain ending at j and pick the best one the key to all dynamic programming algorithms largest value in V array Where do we find the answer How do we find the actual chain backtracking O n 2 What is the running time CMSC423 Fall 2009 10 Chaining in 1D Sort the endpoints starts ends of the intervals For every interval j store V j best score of a chain ending in j MAX store highest V j seen sofar Process endpoints in increasing order of x coordinate If we encounter left end start of interval j V j weight j MAX If we encounter right end end of interval j MAX max V j MAX Running time O n log n from sorting CMSC423 Fall 2009 11 Chaining in 1 D Input multiple overlapping intervals on a line Output highest weight set of non overlapping intervals Weight could be length of interval or Smith Waterman score etc Rationale The pattern can have multiple inconsistent exact matches to the text we want to pick a longest consistent set T P CMSC423 Fall 2009 12 Path planning and dynamic programming One intuitive way to think about dynamic programming similar to finding shortest path between two points at each point ask what are all possible ways to get here pick the best shortest fastest etc NYC Harrisburg Frederick Philly Baltimore DC CMSC423 Fall 2009 13 Chaining in 2 D Easy to do in O n2 n of intervals View alignments as boxes All boxes in a chain must follow each other in a diagonal order i e the range of the x coordinates and y coordinates of any two boxes in a chain cannot overlap Similar to 1 D approach except at each step we must check if current box can extend any of the previously built chains V j maxall previous boxes k V k weight j More complex algorithm leads to O n log n running time CMSC423 Fall 2009 14 Suffix trees dynamic programming Idea find inexact seeds rather than exact matches Observation if two sequences match within x identity there must be some short subsequence that also matches with at least x identity e f i k l Why is this useful You can backtrack quickly if error rate exceeded short sequences will have to be almost perfect CMSC423 Fall 2009 15 suffix trees dynamic programming A3 2 1 1 N2 1 0 1 A1 0 1 2 0 1 2 3 A NN CMSC423 Fall 2009 16


View Full Document

UMD CMSC 423 - Alignment heuristics

Documents in this Course
Midterm

Midterm

8 pages

Lecture 7

Lecture 7

15 pages

Load more
Loading Unlocking...
Login

Join to view Alignment heuristics and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Alignment heuristics and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?