DOC PREVIEW
UMD CMSC 423 - Lecture 12

This preview shows page 1-2-3-4-5 out of 14 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 14 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 14 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 14 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 14 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 14 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 14 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

CMSC423 Fall 2008 1CMSC423: Bioinformatic Algorithms, Databases and ToolsLecture 12chaining algorithmsmultiple alignmentCMSC423 Fall 2008 2Jobs• Applied Predictive Technologies – looking for the best students – focus on databases (forwarded by Daniel Hackner) -not bioinformaticsCMSC423 Fall 2008 3Chaining in 1-D • Input: multiple overlapping intervals on a line•Output: highest weight set of non-overlapping intervals• Weight could be length of interval, or Smith-Waterman score, etc.• Rationale? The pattern can have multiple inconsistent exact matches to the text – we want to pick a longest consistent setTPCMSC423 Fall 2008 4Path “planning” and dynamic programming• One intuitive way to think about dynamic programming– similar to finding shortest path between two points–at each “point” ask – what are all possible ways to get here?– pick the best (shortest, fastest, etc.)DCDCFrederickBaltimoreHarrisburgPhillyNYCCMSC423 Fall 2008 5Chaining in 1D• Sort the endpoints (starts, ends) of the intervals•For every interval j, store V[j] – best score of a chain ending in j• MAX – store highest V[j] seen sofar•Process endpoints in increasing order of x coordinate• If we encounter left end (start) of interval j– V[j] = weight(j) + MAX•If we encounter right end (end) of interval j– MAX = max{V[j], MAX}• Running time?CMSC423 Fall 2008 6Chaining in 2-D• Easy to do in O(n2) (n - # of intervals)•View alignments as "boxes"• All boxes in a chain must follow each other in a "diagonal" order, i.e. the range of the x coordinates and y coordinates of any two boxes in a chain cannot overlap•Similar to 1-D approach except at each step we must check if current box can extend any of the previously built chains•V[j] = maxall previous boxes k {V[k] + weight(j)}•More complex algorithm leads to O(n log n) running timeCMSC423 Fall 2008 7Multiple sequence alignmentCMSC423 Fall 2008 8Multiple sequence alignment• Simultaneously identify relationship between multiple sequences•Note: multiple alignment implies (not necessarily optimal) pairwise alignment between the individual sequencesHBB_HUMAN FFESFGDLSTPDAVMGNPKVKAHGKKVL-----GAFSDGLAHLDNLKGTF HBB_HORSE FFDSFGDLSNPGAVMGNPKVKAHGKKVL-----HSFGEGVHHLDNLKGTF HBA_HUMAN YFPHF-DLS-----HGSAQVKGHGKKVA-----DALTNAVAHVDDMPNAL HBA_HORSE YFPHF-DLS-----HGSAQVKAHGKKVG-----DALTLAVGHLDDLPGAL MYG_PHYCA KFDRFKHLKTEAEMKASEDLKKHGVTVL-----TALGAILKKKGHHEAEL GLB5_PETMA FFPKFKGLTTADQLKKSADVRWHAERII-----NAVNDAVASMDDTEKMS LGB2_LUPLU LFSFLKGTSEVP--QNNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATL * : . . .:: *. : :. : HBA_HUMAN YFPHF-DLS-----HGSAQVKGHGKKVA-----DALTNAVAHVDDMPNAL HBA_HORSE YFPHF-DLS-----HGSAQVKAHGKKVG-----DALTLAVGHLDDLPGALCMSC423 Fall 2008 9Multiple alignment – formal definition•M – multiple sequence alignment for s1,...,sk•D(si,sj) – optimal score of alignment between si, sj•d(si,sj) – score of alignment btwn si, sj induced by M•score of M d(M) = sumall pairs si, sj d(si, sj)• also called sum-of-pairs•Optimal multiple alignment minimizes d(M) • Computing optimal d(M) is NP hard•Note: in multiple alignment we think of "distance" rather than "similarity"CMSC423 Fall 2008 10But....here's a solution• Dynamic programming solution. e.g. 3 sequences•Score(i, j, k) – optimal alignment between s1[1..i], s2[1..j], s3[1..k] – do DP as usual• s(i,j,k) = max { s(i-1, j-1, k-1) + match(s1[i], s2[j], s3[k]), ...s1s2s3CMSC423 Fall 2008 11But... it's expensive• 3 sequences – need to fill in the cube O(n3)• k sequences – k-dimensional cube O(nk) time/space•There are tricks that can help – similar to AI techniques for reducing the search space•Basic idea – if we can estimate optimal score, we can prune the search space.• Note – these are just heuristics – not guaranteed to work fasterCMSC423 Fall 2008 12Alternative – approximation algorithm• Can we efficiently compute a multiple alignment with a score that's not too bad?• The Star method:–build all k2 pairwise alignments (O(k2n2))– pick sequence sc that is closest to all other sequences: sum si D(sc, si) is minimal over all choices of sc–iteratively align each sequence to sc• Theorem: sum-of-pairs score of star alignment is at most twice as big as optimal multiple alignment score13Iterative alignment• Take sequences si in order:–align s1 with sc - results in gaps being inserted in both sequences– align s2 with sc - if gaps must be inserted – insert in previously aligned sequences – and so on (note: if gaps coincide with previously introduced gaps no need to change previously aligned sequences)SC YFPHFDLSHGSAQVKAHGKKVGDALTLAVGHLDDLPGALSC YFPHFDLSHGSAQVKAHGKKVGDALTLAVGHLDDLPGALS1 YFPHFDLSHG-AQVKG--KKVADALTNAVAHVDDMPNALSC YFPHF-DLS-----HGSAQVKAHGKKVG-----DALTLAVAHLDDLPGALS1 YFPHF-DLS-----HG-AQVKG—GKKVA-----DALTNAVAHVDDMPNALS2 FFPKFKGLTTADQLKKSADVRWHAERII-----NAVNDAVASMDDTEKMSSC YFPHF-DLS-----HGSAQVKAHGKKVG-----DALTLAVAHLDDLPGALS1 YFPHF-DLS-----HG-AQVKG—GKKVA-----DALTNAVAHVDDMPNALS2 FFPKFKGLTTADQLKKSADVRWHAERII-----NAVNDAVASMDDTEKMSS3 LFSFLKGTSEVP--QNNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLCMSC423 Fall 2008 14Theorem proof• Theorem: star alignment is 2-optimal•Assumption: distances obey triangle inequalityOPT = ∑si,sj d*(si,sj) ≥ ∑si,sj D(si,sj)≥ k ∑si D(si, sc)STAR = ∑si,sj d(si,sj) ≤ ∑si,sj(D(si, sc) + D(sj, sc)) # triangle ineq. = ∑sj,sjD(sj, sc) + ∑sj,sjD(si, sc) = 2k ∑siD(si, sc)=> STAR/OPT ≤ 2 Q.E.Dnote: ∑siD(si, sc) – is score optimized by choice of scd*(si,sj) – score of alignment btwn si, sj within optimal alignmentd(si,sj) – score of alignment btwn si, sj withinstar alignmentD(si,sj) – score of optimal alignment btwnsi,


View Full Document

UMD CMSC 423 - Lecture 12

Documents in this Course
Midterm

Midterm

8 pages

Lecture 7

Lecture 7

15 pages

Load more
Download Lecture 12
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture 12 and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture 12 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?