Sequence AlignmentSlide 2How do we compute the best alignment?Alignment is additiveDynamic ProgrammingDynamic Programming (cont’d)Slide 7ExampleThe Needleman-Wunsch MatrixThe Needleman-Wunsch AlgorithmPerformanceA variant of the basic algorithm:Different types of overlapsThe Overlap Detection variantThe local alignment problemWhy local alignment – examplesCross-species genome similarityThe Smith-Waterman algorithmSlide 19Scoring the gaps more accuratelyConvex gap dynamic programmingCompromise: affine gapsNeedleman-Wunsch with affine gapsSlide 24To generalize a bit…Bounded Dynamic ProgrammingSlide 27Linear-Space AlignmentSubsequences and SubstringsHirschberg’s algortihmIntroduction: Compute optimal scoreLinear-space alignmentSlide 33Slide 34Slide 35Slide 36Sequence AlignmentSequence Alignment-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGACGiven two strings x = x1x2...xM, y = y1y2…yN,Find the alignment with maximum scoreF = (# matches) m - (# mismatches) s – (#gaps) dAGGCTATCACCTGACCTCCAGGCCGATGCCCTAGCTATCACGACCGCGGTCGATTTGCCCGACHow do we compute the best alignment?AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGAAGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTCToo many possible alignments:>> 2N(exercise)Alignment is additiveObservation:The score of aligning x1……xMy1……yNis additiveSay that x1…xixi+1…xM aligns to y1…yjyj+1…yNThe two scores add up:F(x1…xM, y1…yN) = F(x1…xi, y1...yj) + F(xi+1…xm, yj+1…yN)Dynamic Programming•There are only a polynomial number of subproblemsAlign x1…xi to y1…yj•Original problem is one of the subproblemsAlign x1…xM to y1…yN•Each subproblem is easily solved from smaller subproblemsWe will show next•Then, we can apply Dynamic Programming!!!Dynamic Programming!!!Let F(i, j) = optimal score of aligningx1……xiy1……yjF is the DP “Matrix” or “Table”“Memoization”Dynamic Programming (cont’d)Notice three possible cases:1. xi aligns to yjx1……xi-1 xiy1……yj-1 yj2. xi aligns to a gapx1……xi-1 xiy1……yj -3. yj aligns to a gapx1……xi -y1……yj-1 yj m, if xi = yjF(i, j) = F(i – 1, j – 1) + -s, if not F(i, j) = F(i – 1, j) – d F(i, j) = F(i, j – 1) – dDynamic Programming (cont’d)How do we know which case is correct?Inductive assumption:F(i, j – 1), F(i – 1, j), F(i – 1, j – 1) are optimalThen, F(i – 1, j – 1) + s(xi, yj)F(i, j) = max F(i – 1, j) – d F(i, j – 1) – dWhere s(xi, yj) = m, if xi = yj; -s, if notG -A G T A0 -1 -2 -3 -4A -1 1 0 -1 -2T -2 0 0 1 0A -3 -1 -1 0 2F(i,j) i = 0 1 2 3 4Examplex = AGTA m = 1y = ATA s = -1d = -1j = 0123F(1, 1) = max{F(0,0) + s(A, A), F(0, 1) – d, F(1, 0) – d} =max{0 + 1, -1 – 1, -1 – 1} = 1AATTAAProcedure to output Alignment• Follow the backpointers• When diagonal,OUTPUT xi, yj• When up,OUTPUT yj• When left,OUTPUT xiThe Needleman-Wunsch Matrixx1 ……………………………… xMy1 ……………………………… yNEvery nondecreasing path from (0,0) to (M, N) corresponds to an alignment of the two sequencesAn optimal alignment is composed of optimal subalignmentsThe Needleman-Wunsch Algorithm1. Initialization.a. F(0, 0) = 0b. F(0, j) = - j dc. F(i, 0) = - i d2. Main Iteration. Filling-in partial alignmentsa. For each i = 1……M For each j = 1……N F(i – 1,j – 1) + s(xi, yj) [case 1]F(i, j) = max F(i – 1, j) – d [case 2] F(i, j – 1) – d [case 3]DIAG, if [case 1]Ptr(i, j) = LEFT, if [case 2]UP, if [case 3]3. Termination. F(M, N) is the optimal score, andfrom Ptr(M, N) can trace back optimal alignmentPerformance•Time:O(NM)•Space:O(NM)•Later we will cover more efficient methodsA variant of the basic algorithm:•Maybe it is OK to have an unlimited # of gaps in the beginning and end:----------CTATCACCTGACCTCCAGGCCGATGCCCCTTCCGGC ||||||| |||| | || ||GCGAGTTCATCTATCAC--GACCGC--GGTCG--------------•Then, we don’t want to penalize gaps in the endsDifferent types of overlapsExample:2 overlapping“reads” from a sequencing project – recall Lecture 1Example:Search for a mouse genewithin a human chromosomeThe Overlap Detection variantChanges:1. InitializationFor all i, j,F(i, 0) = 0F(0, j) = 02. Termination maxi F(i, N)FOPT = max maxj F(M, j)x1 ……………………………… xMy1 ……………………………… yNThe local alignment problemGiven two strings x = x1……xM, y = y1……yNFind substrings x’, y’ whose similarity (optimal global alignment value)is maximumx = aaaacccccggggttay = ttcccgggaaccaaccWhy local alignment – examples •Genes are shuffled between genomes•Portions of proteins (domains) are often conservedCross-species genome similarity•98% of genes are conserved between any two mammals•>70% average similarity in protein sequencehum_a : GTTGACAATAGAGGGTCTGGCAGAGGCTC--------------------- @ 57331/400001mus_a : GCTGACAATAGAGGGGCTGGCAGAGGCTC--------------------- @ 78560/400001rat_a : GCTGACAATAGAGGGGCTGGCAGAGACTC--------------------- @ 112658/369938fug_a : TTTGTTGATGGGGAGCGTGCATTAATTTCAGGCTATTGTTAACAGGCTCG @ 36008/68174 hum_a : CTGGCCGCGGTGCGGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTG @ 57381/400001mus_a : CTGGCCCCGGTGCGGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTG @ 78610/400001rat_a : CTGGCCCCGGTGCGGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTG @ 112708/369938fug_a : TGGGCCGAGGTGTTGGATGGCCTGAGTGAAGCACGCGCTGTCAGCTGGCG @ 36058/68174 hum_a : AGCGCACTCTCCTTTCAGGCAGCTCCCCGGGGAGCTGTGCGGCCACATTT @ 57431/400001mus_a : AGCGCACTCG-CTTTCAGGCCGCTCCCCGGGGAGCTGAGCGGCCACATTT @ 78659/400001rat_a : AGCGCACTCG-CTTTCAGGCCGCTCCCCGGGGAGCTGCGCGGCCACATTT @ 112757/369938fug_a : AGCGCTCGCG------------------------AGTCCCTGCCGTGTCC @ 36084/68174 hum_a : AACACCATCATCACCCCTCCCCGGCCTCCTCAACCTCGGCCTCCTCCTCG @ 57481/400001mus_a : AACACCGTCGTCA-CCCTCCCCGGCCTCCTCAACCTCGGCCTCCTCCTCG @ 78708/400001rat_a : AACACCGTCGTCA-CCCTCCCCGGCCTCCTCAACCTCGGCCTCCTCCTCG @ 112806/369938fug_a : CCGAGGACCCTGA------------------------------------- @ 36097/68174“atoh” enhancer in human, mouse, rat, fugu fishThe Smith-Waterman algorithmIdea: Ignore badly aligning regionsModifications to Needleman-Wunsch:Initialization: F(0, j) = F(i, 0) = 00Iteration: F(i, j) = max F(i – 1, j) – dF(i, j – 1) – dF(i – 1, j – 1) + s(xi, yj)The Smith-Waterman algorithmTermination:1. If we want the best local alignment…FOPT = maxi,j F(i, j)Find FOPT and
View Full Document