Sequence Alignment Cont’dLinear-space alignmentThe Four-Russian AlgorithmHeuristic Local Aligners BLAST, WU-BLAST, BlastZ, MegaBLAST, BLAT, PatternHunter, ……State of biological databasesSlide 6Some useful applications of alignmentsSlide 8BLASTBLAST Original VersionSlide 11Gapped BLASTSlide 13Variants of BLASTExampleSlide 16BLAT: Blast-Like Alignment ToolPatternHunterAdvantage of Non-Consecutive WordsHidden Markov ModelsOutline for our next topicExample: The Dishonest CasinoQuestion # 1 – EvaluationQuestion # 2 – DecodingQuestion # 3 – LearningThe dishonest casino modelDefinition of a hidden Markov modelA HMM is memory-lessA parse of a sequenceLikelihood of a parseExample: the dishonest casinoSlide 32Slide 33The three main questions on HMMsLet’s not be confused by notationSequence AlignmentCont’dLinear-space alignment•Iterate this procedure to the left and right!N-k*M/2M/2k*The Four-Russian AlgorithmMain structure of the algorithm:•Divide NN DP matrix into K K log2N-blocks that overlap by 1 column & 1 row•For i = 1……K• For j = 1……K• Compute Di,j as a function of Ai,j, Bi,j, Ci,j, x[li…l’i], y[rj…r’j]Time: O(N2 / log2N) times the cost of step 4tttHeuristic Local AlignersBLAST, WU-BLAST, BlastZ, MegaBLAST, BLAT, PatternHunter, ……State of biological databasesSequenced Genomes:Human 3109Yeast 1.2107Mouse 2.7109 12 different strainsRat 2.6109Neurospora 4107 14 more fungi within next yearFugu fish 3.3108Tetraodon 3108~250 bacteria/virusesMosquito 2.8108Next year: Drosophila 1.2108 Dog, Chimpanzee, ChickenWorm 1.01082 sea squirts 1.6108 Current rate of sequencing:Rice 1.0109 4 big labs 3 109 bp /year/labArabidopsis 1.2108 10s small labsState of biological databases•Number of genes in these genomes:Vertebrate: ~30,000Insects: ~14,000Worm: ~17,000Fungi: ~6,000-10,000Small organisms: 100s-1,000s•Each known or predicted gene has an associated protein sequence•>1,000,000 known / predicted protein sequencesSome useful applications of alignments•Given a newly discovered gene,Does it occur in other species?How fast does it evolve?•Assume we try Smith-Waterman:The entire genomic databaseOur new gene1041010 - 1011Some useful applications of alignments•Given a newly sequenced organism,•Which subregions align with other organisms?Potential genesOther biological characteristics•Assume we try Smith-Waterman:The entire genomic databaseOur newly sequenced mammal31091010 - 1011BLAST(Basic Local Alignment Search Tool)Main idea:1. Construct a dictionary of all the words in the query2. Initiate a local alignment for each word match between query and DBRunning Time: O(MN)However, orders of magnitude faster than Smith-WatermanqueryDBBLAST Original VersionDictionary:All words of length k (~11)Alignment initiated between words of alignment score T (typically T = k)Alignment:Ungapped extensions until score below statistical thresholdOutput:All local alignments with score > statistical threshold…………queryDBqueryscanBLAST Original VersionA C G A A G T A A G G T C C A G TC C C T T C C T G G A T T G C G AExample:k = 4,T = 4The matching word GGTC initiates an alignmentExtension to the left and right with no gaps until alignment falls < 50%Output:GTAAGGTCCGTTAGGTCCGapped BLASTA C G A A G T A A G G T C C A G TC T G A T C C T G G A T T G C G AAdded features:•Pairs of words can initiate alignment•Extensions with gaps in a band around anchorOutput:GTAAGGTCCAGTGTTAGGTC-AGTGapped BLASTA C G A A G T A A G G T C C A G TC T G A T C C T G G A T T G C G AAdded features:•Pairs of words can initiate alignment•Nearby alignments are merged•Extensions with gaps until score < T below best score so farOutput:GTAAGGTCCAGTGTTAGGTC-AGTVariants of BLAST•MEGABLAST:Optimized to align very similar sequences•Works best when k = 4i 16•Linear gap penalty•PSI-BLAST:BLAST produces many hitsThose are aligned, and a pattern is extractedPattern is used for next search; above steps iterated•WU-BLAST: (Wash U BLAST)Optimized, added features•BlastZCombines BLAST/PatternHunter methodologyExampleQuery: gattacaccccgattacaccccgattaca (29 letters) [2 mins]Database: All GenBank+EMBL+DDBJ+PDB sequences (but no EST, STS, GSS, or phase 0, 1 or 2 HTGS sequences) 1,726,556 sequences; 8,074,398,388 total letters >gi|28570323|gb|AC108906.9| Oryza sativa chromosome 3 BAC OSJNBa0087C10 genomic sequence, complete sequence Length = 144487 Score = 34.2 bits (17), Expect = 4.5 Identities = 20/21 (95%) Strand = Plus / Plus Query: 4 tacaccccgattacaccccga 24 ||||||| ||||||||||||| Sbjct: 125138 tacacccagattacaccccga 125158 Score = 34.2 bits (17), Expect = 4.5 Identities = 20/21 (95%) Strand = Plus / Plus Query: 4 tacaccccgattacaccccga 24 ||||||| ||||||||||||| Sbjct: 125104 tacacccagattacaccccga 125124 >gi|28173089|gb|AC104321.7| Oryza sativa chromosome 3 BAC OSJNBa0052F07 genomic sequence, complete sequence Length = 139823 Score = 34.2 bits (17), Expect = 4.5 Identities = 20/21 (95%) Strand = Plus / Plus Query: 4 tacaccccgattacaccccga 24 ||||||| ||||||||||||| Sbjct: 3891 tacacccagattacaccccga 3911ExampleQuery: Human atoh enhancer, 179 letters [1.5 min]Result: 57 blast hits1. gi|7677270|gb|AF218259.1|AF218259 Homo sapiens ATOH1 enhanc... 355 1e-95 2. gi|22779500|gb|AC091158.11| Mus musculus Strain C57BL6/J ch... 264 4e-68 3. gi|7677269|gb|AF218258.1|AF218258 Mus musculus Atoh1 enhanc... 256 9e-66 4. gi|28875397|gb|AF467292.1| Gallus gallus CATH1 (CATH1) gene... 78 5e-12 5. gi|27550980|emb|AL807792.6| Zebrafish DNA sequence from clo... 54 7e-05 6. gi|22002129|gb|AC092389.4| Oryza sativa chromosome 10 BAC O... 44 0.068 7. gi|22094122|ref|NM_013676.1| Mus musculus suppressor of Ty ... 42 0.27 8. gi|13938031|gb|BC007132.1| Mus musculus, Similar to suppres... 42 0.27gi|7677269|gb|AF218258.1|AF218258 Mus musculus Atoh1 enhancer sequence Length = 1517 Score = 256 bits (129), Expect = 9e-66 Identities = 167/177 (94%), Gaps = 2/177 (1%) Strand = Plus / Plus Query: 3 tgacaatagagggtctggcagaggctcctggccgcggtgcggagcgtctggagcggagca 62 ||||||||||||| ||||||||||||||||||| |||||||||||||||||||||||||| Sbjct: 1144
View Full Document