Sequence Alignment Cont’dEvolutionScoring FunctionSequence AlignmentThe Needleman-Wunsch AlgorithmThe Smith-Waterman algorithmScoring the gaps more accuratelyCompromise: affine gapsNeedleman-Wunsch with affine gapsSlide 10To generalize a little…Bounded Dynamic ProgrammingSlide 13Linear-Space AlignmentHirschberg’s algortihmIntroduction: Compute optimal scoreLinear-space alignmentSlide 18Slide 19Slide 20Slide 21Slide 22Slide 23The Four-Russian Algorithm A useful speedup of Dynamic Programming [Arlazarov, Dinic, Kronrod, Faradzev 1970]Main ObservationThe Four-Russian AlgorithmSlide 27Slide 28Slide 29Slide 30Slide 31Slide 32Slide 33Slide 34Slide 35Slide 36Indexing-based Local Aligners BLAST, WU-BLAST, BlastZ, MegaBLAST, BLAT, PatternHunter, ……Some useful applications of alignmentsSlide 39Next lectureSequence AlignmentCont’dEvolutionScoring Function•Sequence edits:AGGCCTCMutations AGGACTCInsertionsAGGGCCTCDeletionsAGG.CTCScoring Function:Match: +mMismatch: -sGap: -dScore F = (# matches) m - (# mismatches) s – (#gaps) dSequence Alignment-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGACDefinitionGiven two strings x = x1x2...xM, y = y1y2…yN,an alignment is an assignment of gaps to positions0,…, N in x, and 0,…, N in y, so as to line up each letter in one sequence with either a letter, or a gapin the other sequenceAGGCTATCACCTGACCTCCAGGCCGATGCCCTAGCTATCACGACCGCGGTCGATTTGCCCGACThe Needleman-Wunsch Algorithm1. Initialization.a. F(0, 0) = 0b. F(0, j) = - j dc. F(i, 0) = - i d2. Main Iteration. Filling-in partial alignmentsa. For each i = 1……M For each j = 1……N F(i-1,j-1) + s(xi, yj) [case 1]F(i, j) = max F(i-1, j) – d [case 2] F(i, j-1) – d [case 3]DIAG, if [case 1]Ptr(i,j) = LEFT, if [case 2]UP, if [case 3]3. Termination. F(M, N) is the optimal score, andfrom Ptr(M, N) can trace back optimal alignmentThe Smith-Waterman algorithmIdea: Ignore badly aligning regionsModifications to Needleman-Wunsch:Initialization: F(0, j) = F(i, 0) = 00Iteration: F(i, j) = max F(i – 1, j) – dF(i, j – 1) – dF(i – 1, j – 1) + s(xi, yj)Scoring the gaps more accuratelySimple, linear gap model:Gap of length nincurs penalty ndHowever, gaps usually occur in bunchesConvex gap penalty function:(n):for all n, (n + 1) - (n) (n) - (n – 1) Algorithm: O(N3) time, O(N2) space(n)(n)Compromise: affine gaps(n) = d + (n – 1)e | | gap gap open extendTo compute optimal alignment,At position i,j, need to “remember” best score if gap is open best score if gap is not openF(i, j): score of alignment x1…xi to y1…yjifif xi aligns to yjG(i, j): score ifif xi aligns to a gap after yjH(i, j): score ifif yj aligns to a gap after xiV(i, j) = best score of alignment x1…xi to y1…yjde(n)Needleman-Wunsch with affine gapsWhy do we need two matrices?•xi aligns to yjx1……xi-1 xi xi+1y1……yj-1 yj -2. xi aligns to a gapx1……xi-1 xi xi+1y1……yj …- -Add -dAdd -eG(i+1, j) = V(i, j) – d G(i+1, j) = G(i, j) – eNeedleman-Wunsch with affine gapsInitialization: V(i, 0) = d + (i – 1)eV(0, j) = d + (j – 1)eIteration:V(i, j) = max{ F(i, j), G(i, j), H(i, j) }F(i, j) = V(i – 1, j – 1) + s(xi, yj)V(i – 1, j) – d G(i, j) = maxG(i – 1, j) – eV(i, j – 1) – d H(i, j) = maxH(i, j – 1) – eTermination: similarTime?Space?To generalize a little…… think of how you would compute optimal alignment with this gap function….in time O(MN)(n)Bounded Dynamic ProgrammingAssume we know that x and y are very similarAssumption: # gaps(x, y) < k(N) ( say N>M )xi Then, | implies | i – j | < k(N) yjWe can align x and y more efficiently:Time, Space: O(N k(N)) << O(N2)Bounded Dynamic ProgrammingInitialization:F(i,0), F(0,j) undefined for i, j > kIteration:For i = 1…M For j = max(1, i – k)…min(N, i+k)F(i – 1, j – 1)+ s(xi, yj)F(i, j) = max F(i, j – 1) – d, if j > i – k(N)F(i – 1, j) – d, if j < i + k(N)Termination: sameEasy to extend to the affine gap casex1 ………………………… xMy1 ………………………… yNk(N)Linear-Space AlignmentHirschberg’s algortihm•Longest common subsequenceGiven sequences s = s1 s2 … sm, t = t1 t2 … tn,Find longest common subsequence u = u1 … uk•Algorithm:F(i-1, j)•F(i, j) = max F(i, j-1)F(i-1, j-1) + [1, if si = tj; 0 otherwise]•Hirschberg’s algorithm solves this in linear spaceF(i,j)Introduction: Compute optimal scoreIt is easy to compute F(M, N) in linear spaceAllocate ( column[1] )Allocate ( column[2] )For i = 1….MIf i > 1, then:Free( column[i – 2] )Allocate( column[ i ] )For j = 1…NF(i, j) = …Linear-space alignmentTo compute both the optimal score and the optimal alignment:Divide & Conquer approach:Notation:xr, yr: reverse of x, yE.g. x = accgg;xr = ggccaFr(i, j): optimal score of aligning xr1…xri & yr1…yrj same as aligning xM-i+1…xM & yN-j+1…yNLinear-space alignmentLemma:F(M, N) = maxk=0…N( F(M/2, k) + Fr(M/2, N-k) )xyM/2k*F(M/2, k)Fr(M/2, N-k)Linear-space alignment•Now, using 2 columns of space, we can computefor k = 1…M, F(M/2, k), Fr(M/2, N-k)PLUS the backpointersLinear-space alignment•Now, we can find k* maximizing F(M/2, k) + Fr(M/2, N-k)•Also, we can trace the path exiting column M/2 from k* k*k*+10 1 …… M/2 M/2+1 …… M M+1Linear-space alignment•Iterate this procedure to the left and right!N-k*M/2M/2k*Linear-space alignmentHirschberg’s Linear-space algorithm:MEMALIGN(l, l’, r, r’): (aligns xl…xl’ with yr…yr’)1. Let h = (l’-l)/22. Find (in Time O((l’ – l) (r’-r)), Space O(r’-r))the optimal path, Lh, entering column h-1, exiting column hLet k1 = pos’n at column h – 2 where Lh entersk2 = pos’n at column h + 1 where Lh exits3. MEMALIGN(l, h-2, r, k1)4. Output Lh5. MEMALIGN(h+1, l’, k2, r’)Top level call: MEMALIGN(1, M, 1, N)Linear-space alignmentTime, Space analysis of Hirschberg’s algorithm: To compute optimal path at middle column,For box of size M N,Space: 2NTime: cMN, for some constant cThen, left, right calls cost c( M/2 k* + M/2 (N-k*) ) = cMN/2All recursive calls cost Total Time: cMN + cMN/2 + cMN/4 + ….. = 2cMN = O(MN)Total Space: O(N) for computation,O(N+M) to store the optimal alignmentThe Four-Russian AlgorithmA useful speedup of Dynamic
View Full Document