Dynamic Programming:Sequence alignmentCS 498 SSSaurabh SinhaDNA Sequence Comparison: FirstSuccess Story• Finding sequence similarities with genes ofknown function is a common approach toinfer a newly sequenced gene’s function• In 1984 Russell Doolittle and colleaguesfound similarities between cancer-causinggene and normal growth factor (PDGF) gene• A normal growth gene switched on at thewrong time causes cancer !Cystic Fibrosis• Cystic fibrosis (CF) is a chronic and frequently fatal geneticdisease of the body's mucus glands. CF primarily affects therespiratory systems in children.• If a high % of cystic fibrosis (CF) patients have a certainmutation in the gene and the normal patients don’t, then thatcould be an indicator of a mutation that is related to CF• A certain mutation was found in 70% of CF patients,convincing evidence that it is a predominant geneticdiagnostics marker for CFCystic Fibrosis and CFTR Gene :Bring in the Bioinformaticians• Gene similarities between two genes with knownand unknown function alert biologists to somepossibilities• Computing a similarity score between two genestells how likely it is that they have similarfunctions• Dynamic programming is a technique forrevealing similarities between genesMotivating DynamicProgrammingDynamic programming example:Manhattan Tourist ProblemImagine seeking apath (from sourceto sink) to travel(only eastward andsouthward) with themost number ofattractions (*) in theManhattan gridSink******** ***Source*Imagine seeking apath (from sourceto sink) to travel(only eastward andsouthward) with themost number ofattractions (*) in theManhattan gridSink******** ***Source*Dynamic programming example:Manhattan Tourist ProblemManhattan Tourist Problem: FormulationGoal: Find the longest path in a weightedgrid.Input: A weighted grid G with two distinctvertices, one labeled “source” and the otherlabeled “sink”Output: A longest path in G from “source” to“sink”MTP: An Example3 2 40 7 33 301 3 24456465582250 1 2 30123j coordinatei coordinate13sourcesink43 2 4 01 024 331122241995152302034MTP: Greedy Algorithm Is Not Optimal1 2 5 2152 340 0 05303501035512promising start,but leads tobad choices!sourcesink1822MTP: Simple RecursiveProgramMT(n,m) if n=0 or m=0 return MT(n,m) x MT(n-1,m)+ length of the edge from (n- 1,m) to (n,m) y MT(n,m-1)+ length of the edge from (n,m-1) to (n,m) return max{x,y}What’s wrong with this approach?Here’s what’s wrong• M(n,m) needs M(n, m-1) and M(n-1, m)• Both of these need M(n-1, m-1)• So M(n-1, m-1) will be computed atleast twice• Dynamic programming: the same ideaas this recursive algorithm, but keep allintermediate results in a table and reuse150 101isource15S1,0 = 5S0,1 = 1• Calculate optimal path score for each vertex in the graph• Each vertex’s score is the maximum of the prior verticesscore plus the weight of the respective edge in betweenMTP: Dynamic ProgrammingjMTP: Dynamic Programming(cont’d)1 2530 1 2012source1 3584S2,0 = 8iS1,1 = 4S0,2 = 33-5jMTP: Dynamic Programming(cont’d)1 2530 1 2 30123isource1 358840581035-59131-5S3,0 = 8S2,1 = 9S1,2 = 13S3,0 = 8jMTP: Dynamic Programming(cont’d)greedy alg. fails!1 2 5-5 1 -5-53053035010-3-50 1 2 30123isource1 3 85884913 8912S3,1 = 9S2,2 = 12S1,3 = 8jMTP: Dynamic Programming(cont’d)1 2 5-5 1 -5-53 30 053035010-3-5-520 1 2 30123isource1 3 85884913 8129159jS3,2 = 9S2,3 = 15MTP: Dynamic Programming(cont’d)1 2 5-5 1 -5-53 30 053035010-3-5-520 1 2 30123isource1 3 85884913 8129159j0116S3,3 = 16(showing all back-traces)Done!MTP: RecurrenceComputing the score for a point (i,j) by therecurrence relation:si, j =maxsi-1, j + weight of the edge between (i-1, j) and (i, j)si, j-1 + weight of the edge between (i, j-1) and (i, j)The running time is n x m for a n by m grid(n = # of rows, m = # of columns)Manhattan Is Not A Perfect GridWhat about diagonals?• The score at point B is given by:sB =maxofsA1 + weight of the edge (A1, B)sA2 + weight of the edge (A2, B)sA3 + weight of the edge (A3, B)BA3A1A2Manhattan Is Not A Perfect Grid (cont’d)Computing the score for point x is given by therecurrence relation:sx =maxofsy + weight of vertex (y, x) where y є Predecessors(x)• Predecessors (x) – set of vertices that have edgesleading to x•The running time for a graph G(V, E)(V is the set of all vertices and E is the set of all edges)is O(E) since each edge is evaluated onceTraveling in the Grid•By the time the vertex x is analyzed, the valuessy for all its predecessors y should be computed– otherwise we are in trouble.•We need to traverse the vertices in some order•For a grid, can traverse vertices row by row,column by column, or diagonal by diagonal•If we had, instead of a (directed) grid, anarbitrary DAG (directed acyclic graph) ?Traversing the Manhattan Grid• 3 different strategies:– a) Column by column– b) Row by row– c) Along diagonalsa) b)c)Topological Ordering• A numbering of vertices of the graph iscalled topological ordering of the DAG ifevery edge of the DAG connects a vertexwith a smaller label to a vertex with alarger label• In other words, if vertices are positioned ona line in an increasing order of labels thenall edges go from left to right.• How to obtain a topological sort (???)Longest Path in DAG Problem• Goal: Find a longest path between twovertices in a weighted DAG• Input: A weighted DAG G with source andsink vertices• Output: A longest path in G from source tosinkLongest Path in DAG: Dynamic Programming• Suppose vertex v has indegree 3 andpredecessors {u1, u2, u3}• Longest path to v from source is:In General:sv = maxu (su + weight of edge from u to v)sv =maxofsu1 + weightofedgefrom u1 tovsu2 + weightofedgefrom u2 tovsu3 + weightofedgefrom u3 tovAlignmentAligning DNA SequencesV = ATCTGATGW = TGCATACn = 8m = 7CATACGTGTAGTCTAV W matchdeletioninsertionmismatchindels4122 matches mismatches insertions deletionsAlignment : 2 x k matrix ( k ≥ m, n )Longest Common Subsequence (LCS) –Alignment without Mismatches• Given two sequences v = v1 v2…vm and w = w1 w2…wn• The LCS of v and w is a sequence of positions inv: 1 < i1 < i2 < … < it < mand a sequence of positions inw: 1 < j1 < j2 < … < jt < nsuch that it -th letter of v equals to jt-letter of w and tis maximalLCS: ExampleA T--C T G A T C--T G C T--A--Celements of velements of w--A12012233435455666778j coords:i
View Full Document