CMSC 423 Sequence Alignment Slides By Carl Kingsford Department of Computer Science University of Maryland College Park Based on Section 6 6 of Algorithm Design by Kleinberg Tardos The Problem Given Two strings a a1 a2 a3 a4 am b b1 b2 b3 b4 bn ai bi L for some alphabet L like A C G T Compute how similar the two strings are What do we mean by similarity between two strings Alignment Examples prin ciple XX prinncipal 1 gap 2 mm misspell mis pell 1 gap aa bb ccaabb X ababbbc a b 5 gaps 1 mm prin cip le prinncipal 3 gaps 0 mm prehistoric historic 3 gaps 0 mm al go rithm XX X alKhwariz mi 4 gaps 3 mm Motivation Alignment is used extensively in molecular biology where a and b are the DNA sequences of two genes see NCBI BLAST Spell checkers Inexact search of web pages NCBI BLAST NCBI BLAST Alignment gb AC115706 7 Mus musculus chromosome 8 clone RP23 382B3 complete sequence Query 1650 Sbjct 56838 Query 1710 Sbjct 56896 Query 1769 Sbjct 56948 Query 1829 Sbjct 57008 Query 1889 Sbjct 57056 Query 1943 Sbjct 57115 Query 2003 Sbjct 57169 Query 2063 Sbjct 57225 gtgtgtgtgggtgcacatttgtgtgtgtgtgcgcctgtgtgtgtgggtgcctgtgtgtgt GTGTGTGTGGAAGTGAGTTCATCTGTGTGTGCACATGTGTGTGCA TGCATGCATGTGT 1709 gtg gggcacatttgtgtgtgtgtgtgtgcctgtgtgtgggtgcacatttgtgtgtgtgc GTCCGGGCA TGCATGTCTGTGTGCATGTGTGTGTGTGTGCAT GTGTGAGTAC 1768 ctgtgtgtgtgtgcctgtgtgtgggggtgcacatttgtgtgtgtgtgtgcctgtgtgtgg CTGTGTGTGTATGCTTGTATGTGTGTGTGTGCATGTGTGTAGGTGTGTATATGTGTAAGT 1828 gggtgcacatttgtgtgtgtgtgtgcctgtgtgtgtgggtgcacatttgtgtgtgtgtgt T CATCTGTGTGTATGTGTG TGTGAGAGTGCATGCA TGTGTGTGTGAGT 1888 gcctgtgtgt gtgggtgcacatttgtgtgtgtgtgcctgtg tgtgt gggtgcac TCATCTGTGTCAGTGTATGCTTATGGGTATAACT TAACTGTGCATGTGTAAGTGTGTTC 1942 atttgtgtgtgtgtgtgcctgtgtgtgtgggtgcacatttgtgtgtgtgcctgtgtgtgg ATCTGTGTATGTGTGTG TGTGTGAGTTAGTTCA TCTGTGTGTGAGAGTGTGTGA 2002 gtgcacatttgtgtgtgtgtgcctgtgtgtgtgtgcctgtgtgtgtgggtgcacatttgt G CTCATCTGTGTGTGAGTTCATCTGTATGAGTG TGTGTATGTGTGTGTACAAATGA 2062 gtgtgtgtgtgcctgtgtgtgtgggtgcacatttgtgtgtgtgtgtgtgcctgtgtgtgt GTTCATCTGTGCATGTGTGTGTG TTTAAGTGTGTTCATCTG TGTGCGTGT 2122 56895 56947 57007 57055 57114 57168 57224 57274 Cost Function Parameters gap is the cost of leaving a gap cost x y is the cost of matching character x with character y Objective function Cost of an alignment is sum of the costs of the matches plus the number of gaps The Problem as a Matching Each sequence is a set of nodes one for each character Looking for a low cost matching between the two sequences a x x y z x b y y x x y x Cost of the matching is gap unmatched X ai bj No crossing rule cost ai bj x Consider the last character in each string Consider the last character in each string a a1 a2 a3 a4 am b b1 b2 b3 b4 bn Consider the 4 possibilities 1 am bn are matched to each other 2 am is not matched at all 3 bn is not matched at all 4 am is matched to some bj j 6 n and bn is matched to some ak k 6 m The Three Possibilities This last possibility can t happen am is matched to some bj and bn is matched to some ak If that occurred then there would be a crossing in the matching am ak bk So only possibilities are 1 am bn is in the matching 2 am is not matched 3 bn is not matched bn Recurrence Suppose for now we aren t interested in the actual alignment but only the cost of the optimal alignment Turn the 3 possibilities into a recurrence cost ai bj OPT i 1 j 1 match ai bj OPT i j min gap OPT i 1 j ai is not matched gap OPT i j 1 bj is not matched Base case OPT i 0 i gap and OPT 0 j j gap matching i characters to 0 characters must use i gaps Matrix OPT i 1 j j 9 9g 8 8g 7 7g 6 6g 5 5g 4 4g 3 3g 2 2g 1 1g 0 0 1g 2g 3g 4g 5g 6g 7g 8g 9g 10g 11g 12g 0 1 2 3 4 5 6 7 8 9 10 11 12 OPT i j OPT i j 1 OPT i 1 j 1 i Order to Fill in Cells 9 9g 8 8g 7 7g 6 6g 5 5g 4 4g 3 3g 2 2g 1 1g 0 0 1g 2g 3g 4g 5g 6g 7g 8g 9g 10g 11g 12g 0 1 2 3 4 5 6 7 8 9 10 11 12 Code Align X Y For i 1 m A i 0 i gap For j 1 n A 0 j j gap For i 1 m For j 1 n A i j min cost a i b j A i 1 j 1 gap A i 1 j gap A i j 1 EndFor EndFor Return A m n Running Time Number of subproblems m n Each subproblem takes constant time to solve Total running time O mn Finding the Actual Alignment OPT i 1 j j 9 9g 8 8g 7 7g 6 6g 5 5g 4 4g 3 3g 2 2g 1 1g 0 0 1g 2g 3g 4g 5g 6g 7g 8g 9g 10g 11g 12g 0 1 2 3 4 5 6 7 8 9 10 11 12 OPT i j OPT i j 1 OPT i 1 j 1 i Traceback 9 9g 8 8g 7 7g 6 6g 5 5g 4 4g 3 3g 2 2g 1 1g 0 0 1g 2g 3g 4g 5g 6g 7g 8g 9g 10g 11g 12g 0 1 2 3 4 5 6 7 8 9 10 11 12 How do you output the result ACGT A GA Follow the backtrack pointers starting from entry n m If you follow a diagonal pointer add both characters to the alignment If you follow a left pointer add a gap to the y axis string and the x axis character If you follow a down pointer add a gap to the x axis string and the y axis character Recasting as a Graph b4 m n b3 b2 b1 edge from i 1 j 1 to i j has weight cost ai bj gap 0 0 a1 a2 gap a3 a4 a5 Recasting as a Graph Theorem Let f i j be the cost of the shortest path from 0 0 to i j Then OPT i j f i j for all i j General DP Principles Main Idea of Dynamic Programming solve the subproblems in an order that makes sure when you need an answer it s already been computed 1 Optimal value of the original problem can be computed easily from some subproblems 2 There are only a polynomial of subproblems 3 There is a natural ordering of the subproblems from smallest to largest such that you can obtain the solution for a subproblem by only looking at smaller subproblems Longest Common Subsequence LCS Longest Common Subsequence LCS Another example of the use of dynamic programming Problem Given 2 strings of DNA a1 an and b1 bm find the longest subsequence that is shared by both strings Example aact tc gga a tatctg acac ttcgaa Note the letters need not be consecutive in either string LCS Subproblems LCS i 1 j 1 LCS i j 1 LCS i 1 j LCS Recurrence Let LCS i j be the length of the longest common subsequence of of a1 ai and b1 bj Then LCS i 1 j 1 1 LCS i 1 j 1 LCS i j max LCS i 1 j LCS i j 1 Answer to original …
View Full Document
Unlocking...