UMD CMSC 423 - Inexact alignment dynamic programming, gapped alignment - D2454003

Home> Schools> University of Maryland, College Park> Computer Science (CMSC) > CMSC 423> Inexact alignment dynamic programming, gapped alignment

UMD CMSC 423 - Inexact alignment dynamic programming, gapped alignment

School name University of Maryland, College Park

Course Cmsc 423- Bioinformatic Algorithms, Databases, and Tools

Pages 16

Download Save

Unformatted text preview:

CMSC423 Bioinformatic Algorithms Databases and Tools Lecture 9 inexact alignment dynamic programming gapped alignment CMSC423 Fall 2008 1 Recap CMSC423 Fall 2008 2 Global alignment recap Score i j is the maximum of 1 Score i 1 j 1 Value S1 i 1 S2 j 1 S1 i 1 S2 j 1 aligned 2 Score i 1 j Value S1 i S1 i aligned to gap 3 Score i j 1 Value S2 j S2 j aligned to gap G T C A G A C A G C G T A G AGCGTAG GTCAGAC Value A A 10 Value A G 5 Value A 2 3 Global alignment recap Score i j is the maximum of 1 Score i 1 j 1 Value S1 i 1 S2 j 1 S1 i 1 S2 j 1 aligned 2 Score i 1 j Value S1 i S1 i aligned to gap 3 Score i j 1 Value S2 j S2 j aligned to gap G T C A G A C 0 4 8 12 16 20 24 28 A 4 5 9 13 2 6 10 14 G 8 6 2 2 6 8 4 0 C 12 2 1 12 8 4 3 14 G 16 2 3 8 7 18 14 10 T 20 6 8 4 3 14 13 9 A 24 10 4 3 14 10 24 20 G 28 14 0 1 10 24 20 19 AG C GTAG GTCAG AC Value A A 10 Value A G 5 Value A 4 4 Local alignment recap Score i j is the maximum of 0 0 1 Score i 1 j 1 Value S1 i 1 S2 j 1 S1 i 1 S2 j 1 aligned 2 Score i 1 j Value S1 i S1 i aligned to gap 3 Score i j 1 Value S2 j S2 j aligned to gap G T C A G A C A G C G T A G AGCGTAG GTCAGAC Value A A 10 Value A G 5 Value A 2 5 Alignment scores CMSC423 Fall 2008 6 Where do the alignment scores come from PAM matrices PAM1 based on frequency of mutations between closely related proteins within 1 evolutionary step PAM 2 within 2 evolutionary steps PAM 250 commonly used BLOSUM matrices Frequency of mutations between proteins that are x similar BLOSUM100 based on proteins that are exactly the same e g score A A is defined but not score A G BLOSUM62 commonly used gap scores usually determined empirically CMSC423 Fall 2008 7 BLOSUM62 CMSC423 Fall 2008 8 Heuristics CMSC423 Fall 2008 9 Heuristics What if limit the of differences allowed E g we expect the sequences to be very similar Compute banded alignment stay within of differences k from the diagonal Optimal alignment cannot stray too far from diagonal k k O km running time and space What if we do not know k Do binary search to find it CMSC423 Fall 2008 10 Exclusion methods Assume P must match T with at most k errors Find places in T where P cannot match Split P into floor n k 1 sized chunks If P matches T with less than k errors at least one chunk matches with no errors Use any exact matching algorithm to find places where a chunk matches T then run dynamic programming in that vicinity Running time on average O m CMSC423 Fall 2008 11 Exclusion methods Putative alignment Text Pattern Exact match CMSC423 Fall 2008 12 Famous approaches FASTA Pearson et al Take all k mers substrings of length k from Pattern and identify whether and where they match in the Text Assume the k mer starting at pos n i in Pattern matches at position j in Text remember j i the diagonal on which the match occured Identify heavy diagonals diagonals where many k mers match then refine the diagonals with Smith Waterman Also look for off diagonal matches to account for gaps CMSC423 Fall 2008 13 Famous approaches BLAST Altschul et al Find short k mer matches Also search for possible inexact matches e g all k mers within 1 difference from current one Extend exact matches with Smith Waterman algorithm Assign probabilistic scores to matches what is the probability of finding a match with the same S W alignment score just by chance e g matching a random string CMSC423 Fall 2008 14 Chaining approach Extends the FASTA idea Search for exact matches Find the longest consistent chain of exact matches Fill in the gaps in the chain using Smith Waterman This is the approach used by MUMmer Delcher et al MUM maximally unique match see mummer sourceforge net CMSC423 Fall 2008 15 Chaining in 1 D Input multiple overlapping intervals on a line Output highest weight set of non overlapping intervals Weight could be length of interval or Smith Waterman score etc Sort the endpoints starts ends of the intervals For every interval j store V j best score of a chain ending in j MAX store highest V j seen sofar Process endpoints in increasing order of x coordinate If we encounter left end start of interval j V j weight j MAX If we encounter right end end of interval j MAX max V j MAX CMSC423 Fall 2008 Running time 16

View Full Document


School:
Email:
New Password:
Confirm Password:

UMD CMSC 423 - Inexact alignment dynamic programming, gapped alignment

Sign up for free to view:

Please select your school