CMSC423 Fall 2008 1CMSC423: Bioinformatic Algorithms, Databases and ToolsLecture 12chaining algorithmsmultiple alignmentCMSC423 Fall 2008 2Jobs• Applied Predictive Technologies – looking for the best students – focus on databases (forwarded by Daniel Hackner) -not bioinformaticsCMSC423 Fall 2008 3Chaining in 1-D • Input: multiple overlapping intervals on a line•Output: highest weight set of non-overlapping intervals• Weight could be length of interval, or Smith-Waterman score, etc.• Rationale? The pattern can have multiple inconsistent exact matches to the text – we want to pick a longest consistent setTPCMSC423 Fall 2008 4Path “planning” and dynamic programming• One intuitive way to think about dynamic programming– similar to finding shortest path between two points–at each “point” ask – what are all possible ways to get here?– pick the best (shortest, fastest, etc.)DCDCFrederickBaltimoreHarrisburgPhillyNYCCMSC423 Fall 2008 5Chaining in 1D• Sort the endpoints (starts, ends) of the intervals•For every interval j, store V[j] – best score of a chain ending in j• MAX – store highest V[j] seen sofar•Process endpoints in increasing order of x coordinate• If we encounter left end (start) of interval j– V[j] = weight(j) + MAX•If we encounter right end (end) of interval j– MAX = max{V[j], MAX}• Running time?CMSC423 Fall 2008 6Chaining in 2-D• Easy to do in O(n2) (n - # of intervals)•View alignments as "boxes"• All boxes in a chain must follow each other in a "diagonal" order, i.e. the range of the x coordinates and y coordinates of any two boxes in a chain cannot overlap•Similar to 1-D approach except at each step we must check if current box can extend any of the previously built chains•V[j] = maxall previous boxes k {V[k] + weight(j)}•More complex algorithm leads to O(n log n) running timeCMSC423 Fall 2008 7Multiple sequence alignmentCMSC423 Fall 2008 8Multiple sequence alignment• Simultaneously identify relationship between multiple sequences•Note: multiple alignment implies (not necessarily optimal) pairwise alignment between the individual sequencesHBB_HUMAN FFESFGDLSTPDAVMGNPKVKAHGKKVL-----GAFSDGLAHLDNLKGTF HBB_HORSE FFDSFGDLSNPGAVMGNPKVKAHGKKVL-----HSFGEGVHHLDNLKGTF HBA_HUMAN YFPHF-DLS-----HGSAQVKGHGKKVA-----DALTNAVAHVDDMPNAL HBA_HORSE YFPHF-DLS-----HGSAQVKAHGKKVG-----DALTLAVGHLDDLPGAL MYG_PHYCA KFDRFKHLKTEAEMKASEDLKKHGVTVL-----TALGAILKKKGHHEAEL GLB5_PETMA FFPKFKGLTTADQLKKSADVRWHAERII-----NAVNDAVASMDDTEKMS LGB2_LUPLU LFSFLKGTSEVP--QNNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATL * : . . .:: *. : :. : HBA_HUMAN YFPHF-DLS-----HGSAQVKGHGKKVA-----DALTNAVAHVDDMPNAL HBA_HORSE YFPHF-DLS-----HGSAQVKAHGKKVG-----DALTLAVGHLDDLPGALCMSC423 Fall 2008 9Multiple alignment – formal definition•M – multiple sequence alignment for s1,...,sk•D(si,sj) – optimal score of alignment between si, sj•d(si,sj) – score of alignment btwn si, sj induced by M•score of M d(M) = sumall pairs si, sj d(si, sj)• also called sum-of-pairs•Optimal multiple alignment minimizes d(M) • Computing optimal d(M) is NP hard•Note: in multiple alignment we think of "distance" rather than "similarity"CMSC423 Fall 2008 10But....here's a solution• Dynamic programming solution. e.g. 3 sequences•Score(i, j, k) – optimal alignment between s1[1..i], s2[1..j], s3[1..k] – do DP as usual• s(i,j,k) = max { s(i-1, j-1, k-1) + match(s1[i], s2[j], s3[k]), ...s1s2s3CMSC423 Fall 2008 11But... it's expensive• 3 sequences – need to fill in the cube O(n3)• k sequences – k-dimensional cube O(nk) time/space•There are tricks that can help – similar to AI techniques for reducing the search space•Basic idea – if we can estimate optimal score, we can prune the search space.• Note – these are just heuristics – not guaranteed to work fasterCMSC423 Fall 2008 12Alternative – approximation algorithm• Can we efficiently compute a multiple alignment with a score that's not too bad?• The Star method:–build all k2 pairwise alignments (O(k2n2))– pick sequence sc that is closest to all other sequences: sum si D(sc, si) is minimal over all choices of sc–iteratively align each sequence to sc• Theorem: sum-of-pairs score of star alignment is at most twice as big as optimal multiple alignment score13Iterative alignment• Take sequences si in order:–align s1 with sc - results in gaps being inserted in both sequences– align s2 with sc - if gaps must be inserted – insert in previously aligned sequences – and so on (note: if gaps coincide with previously introduced gaps no need to change previously aligned sequences)SC YFPHFDLSHGSAQVKAHGKKVGDALTLAVGHLDDLPGALSC YFPHFDLSHGSAQVKAHGKKVGDALTLAVGHLDDLPGALS1 YFPHFDLSHG-AQVKG--KKVADALTNAVAHVDDMPNALSC YFPHF-DLS-----HGSAQVKAHGKKVG-----DALTLAVAHLDDLPGALS1 YFPHF-DLS-----HG-AQVKG—GKKVA-----DALTNAVAHVDDMPNALS2 FFPKFKGLTTADQLKKSADVRWHAERII-----NAVNDAVASMDDTEKMSSC YFPHF-DLS-----HGSAQVKAHGKKVG-----DALTLAVAHLDDLPGALS1 YFPHF-DLS-----HG-AQVKG—GKKVA-----DALTNAVAHVDDMPNALS2 FFPKFKGLTTADQLKKSADVRWHAERII-----NAVNDAVASMDDTEKMSS3 LFSFLKGTSEVP--QNNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLCMSC423 Fall 2008 14Theorem proof• Theorem: star alignment is 2-optimal•Assumption: distances obey triangle inequalityOPT = ∑si,sj d*(si,sj) ≥ ∑si,sj D(si,sj)≥ k ∑si D(si, sc)STAR = ∑si,sj d(si,sj) ≤ ∑si,sj(D(si, sc) + D(sj, sc)) # triangle ineq. = ∑sj,sjD(sj, sc) + ∑sj,sjD(si, sc) = 2k ∑siD(si, sc)=> STAR/OPT ≤ 2 Q.E.Dnote: ∑siD(si, sc) – is score optimized by choice of scd*(si,sj) – score of alignment btwn si, sj within optimal alignmentd(si,sj) – score of alignment btwn si, sj withinstar alignmentD(si,sj) – score of optimal alignment btwnsi,
View Full Document