DOC PREVIEW
UMD CMSC 423 - Inexact alignment dynamic programming, gapped alignment

This preview shows page 1-2-3-26-27-28 out of 28 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 28 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 28 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 28 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 28 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 28 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 28 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 28 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

CMSC423 Fall 2009 1CMSC423: Bioinformatic Algorithms, Databases and Toolsinexact alignmentdynamic programming, gapped alignmentCMSC423 Fall 2009 2Inexact matching: why?• Redundancy in genetic code: nucleotide sequence may differ, but proteins the same• Different amino-acid sequences still fold the same way: function unchanged (generally changing an amino-acid with a similar one doesn't affect protein function)• Aligning ESTs (RNA sequences) to DNA need to account for gaps corresponding to exons• Need to account for sequencing errors•Read chap 6.1!!! (define: ortholog, paralog, homolog)S Y P T DTCTTATCCTACTGATTCATACCCCACAGACCMSC423 Fall 2009 3CMSC423 Fall 2009 4HBB_HUMAN FFESFGDLSTPDAVMGNPKVKAHGKKVL-----GAFSDGLAHLDNLKGTF HBB_HORSE FFDSFGDLSNPGAVMGNPKVKAHGKKVL-----HSFGEGVHHLDNLKGTF HBA_HUMAN YFPHF-DLS-----HGSAQVKGHGKKVA-----DALTNAVAHVDDMPNAL HBA_HORSE YFPHF-DLS-----HGSAQVKAHGKKVG-----DALTLAVGHLDDLPGAL MYG_PHYCA KFDRFKHLKTEAEMKASEDLKKHGVTVL-----TALGAILKKKGHHEAEL GLB5_PETMA FFPKFKGLTTADQLKKSADVRWHAERII-----NAVNDAVASMDDTEKMS LGB2_LUPLU LFSFLKGTSEVP--QNNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATL * : . . .:: *. : :. : Several hemoglobinsFrom http://bioinfo.cnio.es/docus/courses/SEK2003Filogenias/seq_analysis/multiple.htmlCMSC423 Fall 2009 5Warm-up – Longest Common Subsequence• Given two strings of letters, identify longest string of letters that occurs, in the same order, in both stringsAG C GTAG G C G A GTCAG A• Find the longest chain of 1s, moving to the right and down11AGACTG11111A1G1111GATGCCMSC423 Fall 2009 6Dynamic programming• Idea: re-use previously computed information •LCS[i,j] – longest common subsequence of strings S1[1..i], S2[1..j]11AGACTG11111A1G1111GATGCijLCS[i,j] is the maximum of:1.if S1[i] = S2[j] LCS[i-1, j-1] + 1 else LCS[i -1, j-1]2. LCS[i – 1, j]3. LCS[i, j – 1]Goal: find LCS[m,n]CMSC423 Fall 2009 7Computing the LCS table1AGACTG01000A1G00010GATGC4433221AGACTG43332203322211000A111G222222221110010GATGC21AGACTG2011000A111G10010GATGCRow 0 and column 0 easy to fillFill the rest column by columnFind the actual sequence:trace-back pointersCMSC423 Fall 2009 8Extending to sequence alignmentAG-C-GTAG-GTCAG-A-•In LCS, mis-alignments were free•What happens if we pay for our "mistakes"? (this also allows us to account for "similar" amino-acids)–Value[A, A] = 10–Value[A,G] = -5–Value[A,-] = -2–etc.•The same dynamic programming algorithm works!CMSC423 Fall 2009 9The recurrencesAG-C-GTAG-GTCAG-A-Score[i,j] is the maximum of:1. Score[i-1, j-1] + Value[S1[i],S2[j]] AG-C-G AG-C-G -GTCAG -GTCAT2. Score[i – 1, j] + Value[S1[i], -] (S1[i] aligned to gap) AG-C-GT -GTCAG-3. Score[i, j – 1] + Value[-, S2[j]] (S2[j] aligned to gap) AG-C- -GTCACMSC423 Fall 2009 10The dynamic programming tableScore[i,j] is the maximum of:1. Score[i-1, j-1] + Value[S1[i],S2[j]] (S1[i-1], S2[j-1] aligned)2. Score[i – 1, j] + Value[S1[i], -] (S1[i] aligned to gap)3. Score[i, j – 1] + Value[-, S2[j]] (S2[j] aligned to gap)-14-12-10-8-6-4-20--12-10-8-6-4-2-AGACTG-8-6-4A468G1646GATGCValue (A, A) = 10Value (A, G) = -5Value (A, -) = -2Note: we only lookat 3 adjacent boxesCMSC423 Fall 2009 11Intuition•What is the best way to align strings S1 and S2?• just look at last character for now – what is it aligned to?S1[n]S2[m]S1[n]S2[m]S1[n]S2[m]AG-C-GTAG-GTCAG-A-CMSC423 Fall 2009 12The recurrencesAG-C-GTAG-GTCAG-A-Score[i,j] is the maximum of:1. Score[i-1, j-1] + Value[S1[i],S2[j]] AG-C-G AG-C-G -GTCAG -GTCAT2. Score[i – 1, j] + Value[S1[i], -] (S1[i] aligned to gap) AG-C-GT -GTCAG-3. Score[i, j – 1] + Value[-, S2[j]] (S2[j] aligned to gap) AG-C- -GTCACMSC423 Fall 2009 13The dynamic programming tableScore[i,j] is the maximum of:1. Score[i-1, j-1] + Value(S1[i],S2[j]) (S1[i-1], S2[j-1] aligned)2. Score[i – 1, j] + Value(S1[i], -) (S1[i] aligned to gap)3. Score[i, j – 1] + Value(-, S2[j]) (S2[j] aligned to gap)-14-12-10-8-6-4-20--12-10-8-6-4-2-AGACTG-8-6-4A468G1646GATGCValue (A, A) = 10Value (A, G) = -5Value (A, -) = -2Note: we only lookat 3 adjacent boxesCMSC423 Fall 2009 14How do you output the result?•Goal: produce the “nice” string with gaps that is shown in the examples• Idea: create the string backwards – starting from the right• As you follow backtrack pointers:–if you follow diagonal pointer – add characters to both output strings (aligned versions of original strings)–if you move up – add gap character to string represented on the y axis, add string character to string represented on x axis–if you move left – gap goes in string on x axis and character in string on y axis• When you reach (0,0) output the two aligned stringsCMSC423 Fall 2009 15Local vs. global alignment• Can we change the algorithm to allow S1 to be a substring of S2? ACAGTTGACCCGTGCAT ----TG-CC-G------• Key idea: gaps at the end of S2 are free•Simply change the first row in the DP table to 0s• Answer is no longer Score[n, m], rather the largest value in the last rowCMSC423 Fall 2009 16Sub-string alignment00000000--6-4-2-TGCA G26283018618208810GATGCAGCGTAG CGTCMSC423 Fall 2009 17Local alignment• What if we just want a region of similarity? ACAGTTGACCCGTGCAT || || | GTCATG-CC-GAGATCG• First row and column set to 0s• Allow alignment to start anywhere:Score[i,j] = max{0, case 1, case 2, case 3}• Answer is location in matrix with highest scoreCMSC423 Fall 2009 18Local alignment00000000000000CTGCTC3020A0G10GATGCAGCGTAG |||CTCGTCCMSC423 Fall 2009 19Various flavors of alignment• Alignment problem also called "edit distance" – how many changes do you have to make to a string to convert it into another one.• Edit distance also called Levenshtein distance•Local alignment – Smith-Waterman• Global alignment – Needleman-WunschCMSC423 Fall 2009 20Gap penaltiesCMSC423 Fall 2009 21How much do we pay for gaps?• In the edit-distance/alignment frameworkCost(n gaps in a row) = n * Cost(gap)• This doesn't work for e.g. RNA-DNA alignmentsACAGTTCGACTAGAGGACCTAGACCACTCTGT TTCGA----------TAGACCAC• Affine gap penaltiesCost(n gaps in a row) = Cost(gap


View Full Document

UMD CMSC 423 - Inexact alignment dynamic programming, gapped alignment

Documents in this Course
Midterm

Midterm

8 pages

Lecture 7

Lecture 7

15 pages

Load more
Download Inexact alignment dynamic programming, gapped alignment
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Inexact alignment dynamic programming, gapped alignment and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Inexact alignment dynamic programming, gapped alignment 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?