Sequence Alignment Cont’dSequence AlignmentScoring FunctionThe Needleman-Wunsch AlgorithmThe Smith-Waterman algorithmScoring the gaps more accuratelyCompromise: affine gapsNeedleman-Wunsch with affine gapsSlide 9To generalize a little…Bounded Dynamic ProgrammingSlide 12Linear-Space AlignmentHirschberg’s algortihmIntroduction: Compute optimal scoreLinear-space alignmentSlide 17Slide 18Slide 19Slide 20Slide 21Slide 22The Four-Russian Algorithm A useful speedup of Dynamic ProgrammingMain ObservationThe Four-Russian AlgorithmSlide 26Slide 27Slide 28Slide 29Slide 30Slide 31Slide 32Slide 33Slide 34Slide 35Sequence AlignmentCont’dSequence Alignment-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGACDefinitionGiven two strings x = x1x2...xM, y = y1y2…yN,an alignment is an assignment of gaps to positions0,…, M in x, and 0,…, N in y, so as to line up each letter in one sequence with either a letter, or a gapin the other sequenceAGGCTATCACCTGACCTCCAGGCCGATGCCCTAGCTATCACGACCGCGGTCGATTTGCCCGACScoring Function•Sequence edits:AGGCCTCMutations AGGACTCInsertionsAGGGCCTCDeletionsAGG.CTCScoring Function:Match: +mMismatch: -sGap: -dScore F = (# matches) m - (# mismatches) s – (#gaps) dThe Needleman-Wunsch Algorithm1. Initialization.a. F(0, 0) = 0b. F(0, j) = - j dc. F(i, 0) = - i d2. Main Iteration. Filling-in partial alignmentsa. For each i = 1……M For each j = 1……N F(i-1,j-1) + s(xi, yj) [case 1]F(i, j) = max F(i-1, j) – d [case 2] F(i, j-1) – d [case 3]DIAG, if [case 1]Ptr(i,j) = LEFT, if [case 2]UP, if [case 3]3. Termination. F(M, N) is the optimal score, andfrom Ptr(M, N) can trace back optimal alignmentThe Smith-Waterman algorithmIdea: Ignore badly aligning regionsModifications to Needleman-Wunsch:Initialization: F(0, j) = F(i, 0) = 00Iteration: F(i, j) = max F(i – 1, j) – dF(i, j – 1) – dF(i – 1, j – 1) + s(xi, yj)Scoring the gaps more accuratelySimple, linear gap model:Gap of length nincurs penalty ndHowever, gaps usually occur in bunchesConvex gap penalty function:(n):for all n, (n + 1) - (n) (n) - (n – 1) Algorithm: O(N3) time, O(N2) space(n)(n)Compromise: affine gaps(n) = d + (n – 1)e | | gap gap open extendTo compute optimal alignment,At position i,j, need to “remember” best score if gap is open best score if gap is not openF(i, j): score of alignment x1…xi to y1…yjifif xi aligns to yjG(i, j): score ifif xi, or yj, aligns to a gapde(n)Needleman-Wunsch with affine gapsWhy do we need two matrices?•xi aligns to yjx1……xi-1 xi xi+1y1……yj-1 yj -2. xi aligns to a gapx1……xi-1 xi xi+1y1……yj …- -Add -dAdd -eNeedleman-Wunsch with affine gapsInitialization: F(i, 0) = d + (i – 1)eF(0, j) = d + (j – 1)eIteration:F(i – 1, j – 1) + s(xi, yj)F(i, j) = maxG(i – 1, j – 1) + s(xi, yj)F(i – 1, j) – d F(i, j – 1) – d G(i, j) = max G(i, j – 1) – eG(i – 1, j) – e Termination: sameTo generalize a little…… think of how you would compute optimal alignment with this gap function….in time O(MN)(n)Bounded Dynamic ProgrammingAssume we know that x and y are very similarAssumption: # gaps(x, y) < k(N) ( say N>M )xi Then, | implies | i – j | < k(N) yjWe can align x and y more efficiently:Time, Space: O(N k(N)) << O(N2)Bounded Dynamic ProgrammingInitialization:F(i,0), F(0,j) undefined for i, j > kIteration:For i = 1…M For j = max(1, i – k)…min(N, i+k)F(i – 1, j – 1)+ s(xi, yj)F(i, j) = max F(i, j – 1) – d, if j > i – k(N)F(i – 1, j) – d, if j < i + k(N)Termination: sameEasy to extend to the affine gap casex1 ………………………… xMy1 ………………………… yNk(N)Linear-Space AlignmentHirschberg’s algortihm•Longest common subsequenceGiven sequences s = s1 s2 … sm, t = t1 t2 … tn,Find longest common subsequence u = u1 … uk•Algorithm:F(i-1, j)•F(i, j) = max F(i, j-1)F(i-1, j-1) + [1, if si = tj; 0 otherwise]•Hirschberg’s algorithm solves this in linear spaceF(i,j)Introduction: Compute optimal scoreIt is easy to compute F(M, N) in linear spaceAllocate ( column[1] )Allocate ( column[2] )For i = 1….MIf i > 1, then:Free( column[i – 2] )Allocate( column[ i ] )For j = 1…NF(i, j) = …Linear-space alignmentTo compute both the optimal score and the optimal alignment:Divide & Conquer approach:Notation:xr, yr: reverse of x, yE.g. x = accgg;xr = ggccaFr(i, j): optimal score of aligning xr1…xri & yr1…yrj same as F(M-i+1, N-j+1)Linear-space alignmentLemma:F(M, N) = maxk=0…N( F(M/2, k) + Fr(M/2, N-k) )xyM/2k*F(M/2, k)Fr(M/2, N-k)Linear-space alignment•Now, using 2 columns of space, we can computefor k = 1…M, F(M/2, k), Fr(M/2, N-k)PLUS the backpointersLinear-space alignment•Now, we can find k* maximizing F(M/2, k) + Fr(M/2, N-k)•Also, we can trace the path exiting column M/2 from k* k*k*Linear-space alignment•Iterate this procedure to the left and right!N-k*M/2M/2k*Linear-space alignmentHirschberg’s Linear-space algorithm:MEMALIGN(l, l’, r, r’): (aligns xl…xl’ with yr…yr’)1. Let h = (l’-l)/22. Find in Time O((l’ – l) (r’-r)), Space O(r’-r)the optimal path, Lh, entering column h-1, exiting column hLet k1 = pos’n at column h – 2 where Lh entersk2 = pos’n at column h + 1 where Lh exits3. MEMALIGN(l, h-2, r, k1)4. Output Lh5. MEMALIGN(h+1, l’, k2, r’)Top level call: MEMALIGN(1, M, 1, N)Linear-space alignmentTime, Space analysis of Hirschberg’s algorithm: To compute optimal path at middle column,For box of size M N,Space: 2NTime: cMN, for some constant cThen, left, right calls cost c( M/2 k* + M/2 (N-k*) ) = cMN/2All recursive calls cost Total Time: cMN + cMN/2 + cMN/4 + ….. = 2cMN = O(MN)Total Space: O(N) for computation,O(N+M) to store the optimal alignmentThe Four-Russian AlgorithmA useful speedup of Dynamic ProgrammingMain ObservationWithin a rectangle of the DP matrix,values of D depend onlyon the values of A, B, C,and substrings xl...l’, yr…r’Definition: A t-block is a t t square of the DP matrixIdea: Divide matrix in t-blocks,Precompute t-blocksSpeedup: O(t)ABCDxlxl’yryr’tThe Four-Russian AlgorithmMain structure of the algorithm:•Divide NN DP matrix into KK log2N-blocks that overlap by 1 column & 1 row•For i = 1……K• For j = 1……K•
View Full Document