CMSC423 Fall 2009 1CMSC423: Bioinformatic Algorithms, Databases and Toolsinexact alignmentdynamic programming, gapped alignmentCMSC423 Fall 2009 2Inexact matching: why?• Redundancy in genetic code: nucleotide sequence may differ, but proteins the same• Different amino-acid sequences still fold the same way: function unchanged (generally changing an amino-acid with a similar one doesn't affect protein function)• Aligning ESTs (RNA sequences) to DNA need to account for gaps corresponding to exons• Need to account for sequencing errors•Read chap 6.1!!! (define: ortholog, paralog, homolog)S Y P T DTCTTATCCTACTGATTCATACCCCACAGACCMSC423 Fall 2009 3CMSC423 Fall 2009 4HBB_HUMAN FFESFGDLSTPDAVMGNPKVKAHGKKVL-----GAFSDGLAHLDNLKGTF HBB_HORSE FFDSFGDLSNPGAVMGNPKVKAHGKKVL-----HSFGEGVHHLDNLKGTF HBA_HUMAN YFPHF-DLS-----HGSAQVKGHGKKVA-----DALTNAVAHVDDMPNAL HBA_HORSE YFPHF-DLS-----HGSAQVKAHGKKVG-----DALTLAVGHLDDLPGAL MYG_PHYCA KFDRFKHLKTEAEMKASEDLKKHGVTVL-----TALGAILKKKGHHEAEL GLB5_PETMA FFPKFKGLTTADQLKKSADVRWHAERII-----NAVNDAVASMDDTEKMS LGB2_LUPLU LFSFLKGTSEVP--QNNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATL * : . . .:: *. : :. : Several hemoglobinsFrom http://bioinfo.cnio.es/docus/courses/SEK2003Filogenias/seq_analysis/multiple.htmlCMSC423 Fall 2009 5Warm-up – Longest Common Subsequence• Given two strings of letters, identify longest string of letters that occurs, in the same order, in both stringsAG C GTAG G C G A GTCAG A• Find the longest chain of 1s, moving to the right and down11AGACTG11111A1G1111GATGCCMSC423 Fall 2009 6Dynamic programming• Idea: re-use previously computed information •LCS[i,j] – longest common subsequence of strings S1[1..i], S2[1..j]11AGACTG11111A1G1111GATGCijLCS[i,j] is the maximum of:1.if S1[i] = S2[j] LCS[i-1, j-1] + 1 else LCS[i -1, j-1]2. LCS[i – 1, j]3. LCS[i, j – 1]Goal: find LCS[m,n]CMSC423 Fall 2009 7Computing the LCS table1AGACTG01000A1G00010GATGC4433221AGACTG43332203322211000A111G222222221110010GATGC21AGACTG2011000A111G10010GATGCRow 0 and column 0 easy to fillFill the rest column by columnFind the actual sequence:trace-back pointersCMSC423 Fall 2009 8Extending to sequence alignmentAG-C-GTAG-GTCAG-A-•In LCS, mis-alignments were free•What happens if we pay for our "mistakes"? (this also allows us to account for "similar" amino-acids)–Value[A, A] = 10–Value[A,G] = -5–Value[A,-] = -2–etc.•The same dynamic programming algorithm works!CMSC423 Fall 2009 9The recurrencesAG-C-GTAG-GTCAG-A-Score[i,j] is the maximum of:1. Score[i-1, j-1] + Value[S1[i],S2[j]] AG-C-G AG-C-G -GTCAG -GTCAT2. Score[i – 1, j] + Value[S1[i], -] (S1[i] aligned to gap) AG-C-GT -GTCAG-3. Score[i, j – 1] + Value[-, S2[j]] (S2[j] aligned to gap) AG-C- -GTCACMSC423 Fall 2009 10The dynamic programming tableScore[i,j] is the maximum of:1. Score[i-1, j-1] + Value[S1[i],S2[j]] (S1[i-1], S2[j-1] aligned)2. Score[i – 1, j] + Value[S1[i], -] (S1[i] aligned to gap)3. Score[i, j – 1] + Value[-, S2[j]] (S2[j] aligned to gap)-14-12-10-8-6-4-20--12-10-8-6-4-2-AGACTG-8-6-4A468G1646GATGCValue (A, A) = 10Value (A, G) = -5Value (A, -) = -2Note: we only lookat 3 adjacent boxesCMSC423 Fall 2009 11Intuition•What is the best way to align strings S1 and S2?• just look at last character for now – what is it aligned to?S1[n]S2[m]S1[n]S2[m]S1[n]S2[m]AG-C-GTAG-GTCAG-A-CMSC423 Fall 2009 12The recurrencesAG-C-GTAG-GTCAG-A-Score[i,j] is the maximum of:1. Score[i-1, j-1] + Value[S1[i],S2[j]] AG-C-G AG-C-G -GTCAG -GTCAT2. Score[i – 1, j] + Value[S1[i], -] (S1[i] aligned to gap) AG-C-GT -GTCAG-3. Score[i, j – 1] + Value[-, S2[j]] (S2[j] aligned to gap) AG-C- -GTCACMSC423 Fall 2009 13The dynamic programming tableScore[i,j] is the maximum of:1. Score[i-1, j-1] + Value(S1[i],S2[j]) (S1[i-1], S2[j-1] aligned)2. Score[i – 1, j] + Value(S1[i], -) (S1[i] aligned to gap)3. Score[i, j – 1] + Value(-, S2[j]) (S2[j] aligned to gap)-14-12-10-8-6-4-20--12-10-8-6-4-2-AGACTG-8-6-4A468G1646GATGCValue (A, A) = 10Value (A, G) = -5Value (A, -) = -2Note: we only lookat 3 adjacent boxesCMSC423 Fall 2009 14How do you output the result?•Goal: produce the “nice” string with gaps that is shown in the examples• Idea: create the string backwards – starting from the right• As you follow backtrack pointers:–if you follow diagonal pointer – add characters to both output strings (aligned versions of original strings)–if you move up – add gap character to string represented on the y axis, add string character to string represented on x axis–if you move left – gap goes in string on x axis and character in string on y axis• When you reach (0,0) output the two aligned stringsCMSC423 Fall 2009 15Local vs. global alignment• Can we change the algorithm to allow S1 to be a substring of S2? ACAGTTGACCCGTGCAT ----TG-CC-G------• Key idea: gaps at the end of S2 are free•Simply change the first row in the DP table to 0s• Answer is no longer Score[n, m], rather the largest value in the last rowCMSC423 Fall 2009 16Sub-string alignment00000000--6-4-2-TGCA G26283018618208810GATGCAGCGTAG CGTCMSC423 Fall 2009 17Local alignment• What if we just want a region of similarity? ACAGTTGACCCGTGCAT || || | GTCATG-CC-GAGATCG• First row and column set to 0s• Allow alignment to start anywhere:Score[i,j] = max{0, case 1, case 2, case 3}• Answer is location in matrix with highest scoreCMSC423 Fall 2009 18Local alignment00000000000000CTGCTC3020A0G10GATGCAGCGTAG |||CTCGTCCMSC423 Fall 2009 19Various flavors of alignment• Alignment problem also called "edit distance" – how many changes do you have to make to a string to convert it into another one.• Edit distance also called Levenshtein distance•Local alignment – Smith-Waterman• Global alignment – Needleman-WunschCMSC423 Fall 2009 20Gap penaltiesCMSC423 Fall 2009 21How much do we pay for gaps?• In the edit-distance/alignment frameworkCost(n gaps in a row) = n * Cost(gap)• This doesn't work for e.g. RNA-DNA alignmentsACAGTTCGACTAGAGGACCTAGACCACTCTGT TTCGA----------TAGACCAC• Affine gap penaltiesCost(n gaps in a row) = Cost(gap
View Full Document