Genomic Sequence AlignmentOverviewBiology in One Slide – Twentieth CenturyComplete DNA SequencesEvolutionEvolution at the DNA levelEvolutionary RatesSequence conservation implies functionSequence AlignmentWhat is a good alignment?Scoring FunctionHow do we compute the best alignment?Dynamic ProgrammingDynamic Programming (cont’d)Slide 15ExampleThe Needleman-Wunsch AlgorithmAlignment on a Large ScaleIndex-based Local AlignmentIndex-based Local Alignment — BLASTIndex-based Local Alignment — BLASTGapped BLASTSlide 24Efficient global alignmentGlobal alignment with the chaining approachSlide 27Slide 28LAGAN: 3. Restricted DPMultiple AlignmentSlide 31DefinitionScoring Function: Sum Of PairsA Profile RepresentationMultiple Sequence AlignmentsSlide 36Multidimensional DPSlide 38Slide 39Progressive AlignmentSlide 41Slide 42Some useful sitesSlide 44Local & Global AlignmentGlocal Alignment ProblemSlide 47Slide 48Slide 49SLAGAN Example: Chromosome 20SLAGAN example: HOX clusterSlide 52Examples of shuffled regionsSlide 54Slide 55Slide 56Slide 57Slide 58Genomic Sequence AlignmentOverview•Dynamic programming & the Needleman-Wunsch algorithm•Local alignment—BLAST •Fast global alignment•Multiple sequence alignment•Rearrangements in genomic sequencesBiology in One Slide – Twentieth Century…ACGTGACTGAGGACCGTGCGACTGAGACTGACTGGGTCTAGCTAGACTACGTTTTATATATATATACGTCGTCGTACTGATGACTAGATTACAGACTGATTTAGATACCTGACTGATTTTAAAAAAATATT……and todayComplete DNA SequencesAbout 300 complete genomes have been sequencedEvolutionEvolution at the DNA level…ACGGTGCAGTTACCA……AC----CAGTCCACCA…MutationSEQUENCE EDITSREARRANGEMENTSDeletionInversionTranslocationDuplicationEvolutionary Rates OKOKOKXXStill OK?next generationSequence conservation implies functionAlignment is the key to• Finding important regions• Determining function• Uncovering the evolutionary forcesSequence Alignment-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGACDefinitionGiven two strings x = x1x2...xM, y = y1y2…yN,an alignment is an assignment of gaps to positions0,…, N in x, and 0,…, N in y, so as to line up each letter in one sequence with either a letter, or a gapin the other sequenceAGGCTATCACCTGACCTCCAGGCCGATGCCCTAGCTATCACGACCGCGGTCGATTTGCCCGACWhat is a good alignment?Alignment: The “best” way to match the letters of one sequence with those of the otherHow do we define “best”?Alignment:A hypothesis that the two sequences come from a common ancestor through sequence editsParsimonious explanation:Find the minimum number of edits that transform one sequence into the otherScoring Function•Sequence edits: AGGCCTCMutations AGGACTCInsertionsAGGGCCTCDeletionsAGG.CTCScoring Function:Match: +mMismatch: -sGap: -dScore F = (# matches) m - (# mismatches) s – (#gaps) dHow do we compute the best alignment?AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGAAGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTCToo many possible alignments:O( 2N)Dynamic Programming•Given two sequences x = x1……xM and y = y1……yN•Let F(i, j) = Score of best alignment of x1……xi to y1……yj•Then, F(M, N) == Score of best alignmentIdea: Compute F(i, j) for all i and jDo this by using F(i–1 , j), F(i, j–1), F(i–1, j–1)Dynamic Programming (cont’d)Notice three possible cases:1. xi aligns to yjx1……xi-1 xiy1……yj-1 yj2. xi aligns to a gapx1……xi-1 xiy1……yj -3. yj aligns to a gapx1……xi -y1……yj-1 yj m, if xi = yjF(i,j) = F(i-1, j-1) + -s, if not F(i,j) = F(i-1, j) - d F(i,j) = F(i, j-1) - dDynamic Programming (cont’d)•How do we know which case is correct?Inductive assumption:F(i, j-1), F(i-1, j), F(i-1, j-1) are optimalThen,F(i-1, j-1) + s(xi, yj)F(i, j) = max F(i-1, j) – dF( i, j-1) – dWhere s(xi, yj) = m, if xi = yj; -s, if noti-1, j-1 i-1, ji, j-1 i, jExamplex = AGTA m = 1y = ATA s = -1d = -1A G T A0 -1 -2 -3 -4A -1 1 0 -1 -2T -2 0 0 1 0A -3 -1 -1 0 2F(i,j) i = 0 1 2 3 4j = 0123Optimal Alignment:F(4,3) = 2AGTAA - TAThe Needleman-Wunsch Algorithm1. Initialization.a. F(0, 0) = 0b. F(0, j) = - j dc. F(i, 0) = - i d2. Main Iteration. Filling-in partial alignmentsa. For each i = 1……M For each j = 1……N F(i-1,j) – d [case 1]F(i, j) = max F(i, j-1) – d [case 2] F(i-1, j-1) + s(xi, yj) [case 3]UP if [case 1]Ptr(i,j) = LEFT if [case 2]DIAG if [case 3]3. Termination. F(M, N) is the optimal score, andfrom Ptr(M, N) can trace back optimal alignmentAlignment on a Large Scale•Given a gene that we care about, how can we compare it to all existing DNA? •Assume we use Dynamic Programming:The entire genomic databasegene of interest~105~1011Index-based Local AlignmentMain idea:1. Construct a dictionary of all the words in the query2. Initiate a local alignment for each word match between query and DBRunning Time:Theoretical worst case: O(MN)Fast in practicequeryDBIndex-based Local Alignment — BLAST Dictionary:All words of length k (~11)Alignment initiated between exact-matching words (more generally, between words of alignment score T)Alignment:Ungapped extensions until score below statistical thresholdOutput:All local alignments with score > statistical threshold…………queryDBqueryscanIndex-based Local Alignment — BLASTA C G A A G T A A G G T C C A G TC C C T T C C T G G A T T G C G AExample:k = 4,T = 4The matching word GGTC initiates an alignmentExtension to the left and right with no gaps until alignment falls < 50%Output:GTAAGGTCCGTTAGGTCCGapped BLASTA C G A A G T A A G G T C C A G TC T G A T C C T G G A T T G C G AAdded features:•Pairs of words can initiate alignment•Nearby alignments are merged•Extensions with gaps until score < T below best score so farOutput:GTAAGGTCCAGTGTTAGGTC-AGTExampleQuery: gattacaccccgattacaccccgattaca (29 letters) [2 mins]Database: All GenBank+EMBL+DDBJ+PDB sequences (but no EST, STS, GSS, or phase 0, 1 or 2 HTGS sequences) 1,726,556 sequences; 8,074,398,388 total letters >gi|28570323|gb|AC108906.9| Oryza sativa chromosome 3 BAC OSJNBa0087C10 genomic sequence, complete sequence Length = 144487 Score = 34.2 bits (17), Expect = 4.5 Identities = 20/21 (95%) Strand = Plus / Plus Query: 4 tacaccccgattacaccccga 24 ||||||| ||||||||||||| Sbjct: 125138
View Full Document