Unformatted text preview:

CMSC423 Bioinformatic Algorithms Databases and Tools Lecture 12 chaining algorithms multiple alignment CMSC423 Fall 2008 1 Jobs Applied Predictive Technologies looking for the best students focus on databases forwarded by Daniel Hackner not bioinformatics CMSC423 Fall 2008 2 Chaining in 1 D Input multiple overlapping intervals on a line Output highest weight set of non overlapping intervals Weight could be length of interval or Smith Waterman score etc Rationale The pattern can have multiple inconsistent exact matches to the text we want to pick a longest consistent set T P CMSC423 Fall 2008 3 Path planning and dynamic programming One intuitive way to think about dynamic programming similar to finding shortest path between two points at each point ask what are all possible ways to get here pick the best shortest fastest etc NYC Harrisburg Frederick Philly Baltimore DC CMSC423 Fall 2008 4 Chaining in 1D Sort the endpoints starts ends of the intervals For every interval j store V j best score of a chain ending in j MAX store highest V j seen sofar Process endpoints in increasing order of x coordinate If we encounter left end start of interval j V j weight j MAX If we encounter right end end of interval j MAX max V j MAX CMSC423 Fall 2008 Running time 5 Chaining in 2 D Easy to do in O n2 n of intervals View alignments as boxes All boxes in a chain must follow each other in a diagonal order i e the range of the x coordinates and y coordinates of any two boxes in a chain cannot overlap Similar to 1 D approach except at each step we must check if current box can extend any of the previously built chains V j maxall previous boxes k V k weight j More complex algorithm leads to O n log n running time CMSC423 Fall 2008 6 Multiple sequence alignment CMSC423 Fall 2008 7 Multiple sequence alignment Simultaneously identify relationship between multiple sequences HBB HUMAN HBB HORSE HBA HUMAN HBA HORSE MYG PHYCA GLB5 PETMA LGB2 LUPLU FFESFGDLSTPDAVMGNPKVKAHGKKVL GAFSDGLAHLDNLKGTF FFDSFGDLSNPGAVMGNPKVKAHGKKVL HSFGEGVHHLDNLKGTF YFPHF DLS HGSAQVKGHGKKVA DALTNAVAHVDDMPNAL YFPHF DLS HGSAQVKAHGKKVG DALTLAVGHLDDLPGAL KFDRFKHLKTEAEMKASEDLKKHGVTVL TALGAILKKKGHHEAEL FFPKFKGLTTADQLKKSADVRWHAERII NAVNDAVASMDDTEKMS LFSFLKGTSEVP QNNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATL Note multiple alignment implies not necessarily optimal pairwise alignment between the individual sequences HBA HUMAN HBA HORSE YFPHF DLS HGSAQVKGHGKKVA DALTNAVAHVDDMPNAL YFPHF DLS HGSAQVKAHGKKVG DALTLAVGHLDDLPGAL CMSC423 Fall 2008 8 Multiple alignment formal definition M multiple sequence alignment for s1 sk D si sj optimal score of alignment between si sj d si sj score of alignment btwn si sj induced by M score of M d M sumall pairs si sj d si sj also called sum of pairs Optimal multiple alignment minimizes d M Computing optimal d M is NP hard Note in multiple alignment we think of distance rather than similarity CMSC423 Fall 2008 9 But here s a solution Dynamic programming solution e g 3 sequences Score i j k optimal alignment between s1 1 i s2 1 j s3 1 k do DP as usual s i j k max s i 1 j 1 k 1 match s1 i s2 j s3 k CMSC423 Fall 2008 s1 s2 s3 10 But it s expensive 3 sequences need to fill in the cube O n3 k sequences k dimensional cube O nk time space There are tricks that can help similar to AI techniques for reducing the search space Basic idea if we can estimate optimal score we can prune the search space Note these are just heuristics not guaranteed to work faster CMSC423 Fall 2008 11 Alternative approximation algorithm Can we efficiently compute a multiple alignment with a score that s not too bad The Star method build all k2 pairwise alignments O k2n2 pick sequence sc that is closest to all other sequences sum si D sc si is minimal over all choices of sc iteratively align each sequence to sc Theorem sum of pairs score of star alignment is at most twice as big as optimal multiple alignment score CMSC423 Fall 2008 12 Iterative alignment SC YFPHFDLSHGSAQVKAHGKKVGDALTLAVGHLDDLPGAL Take sequences si in order align s1 with sc results in gaps being inserted in both sequences SC YFPHFDLSHGSAQVKAHGKKVGDALTLAVGHLDDLPGAL S1 YFPHFDLSHG AQVKG KKVADALTNAVAHVDDMPNAL align s2 with sc if gaps must be inserted insert in previously aligned sequences SC YFPHF DLS HGSAQVKAHGKKVG DALTLAVAHLDDLPGAL S1 YFPHF DLS HG AQVKG GKKVA DALTNAVAHVDDMPNAL S2 FFPKFKGLTTADQLKKSADVRWHAERII NAVNDAVASMDDTEKMS and so on note if gaps coincide with previously introduced gaps no need to change previously aligned sequences SC S1 S2 S3 YFPHF DLS HGSAQVKAHGKKVG DALTLAVAHLDDLPGAL YFPHF DLS HG AQVKG GKKVA DALTNAVAHVDDMPNAL FFPKFKGLTTADQLKKSADVRWHAERII NAVNDAVASMDDTEKMS LFSFLKGTSEVP QNNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATL 13 Theorem proof Theorem star alignment is 2 optimal Assumption distances obey triangle inequality OPT si sj d si sj si sj D si sj k si D si sc STAR si sj d si sj si sj D si sc D sj sc triangle ineq sj sjD sj sc sj sjD si sc 2k siD si sc STAR OPT 2 Q E D note siD si sc is score optimized by choice of sc d si sj score of alignment btwn si sj within optimal alignment d si sj score of alignment btwn si sj within si star alignment D si sj score of optimal alignment btwn si sj CMSC423 Fall 2008 sc sj 14


View Full Document

UMD CMSC 423 - Lecture 12

Documents in this Course
Midterm

Midterm

8 pages

Lecture 7

Lecture 7

15 pages

Load more
Loading Unlocking...
Login

Join to view Lecture 12 and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture 12 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?