Heuristic Local AlignerersIndexing-based local alignmentIndexing-based local alignment—ExtensionsSensitivity-Speed TradeoffSlide 5Measured improvementNon-consecutive words—PatternsAdvantage of PatternsMultiple patternsVariants of BLASTExampleSlide 12The Four-Russian Algorithm brief overview A (not so useful) speedup of Dynamic Programming [Arlazarov, Dinic, Kronrod, Faradzev 1970]Main ObservationMain Observation—ShiftsThe Four-Russian AlgorithmSlide 17Hidden Markov ModelsOutline for our next topicExample: The Dishonest CasinoQuestion # 1 – EvaluationQuestion # 2 – DecodingQuestion # 3 – LearningThe dishonest casino modelCS262 Lecture 4, Win07, BatzoglouHeuristic Local Alignerers1. The basic indexing & extension technique2. Indexing: techniques to improve sensitivityPairs of Words, Patterns3. Systems for local alignmentCS262 Lecture 4, Win07, BatzoglouIndexing-based local alignmentDictionary:All words of length k (~10)Alignment initiated between words of alignment score T (typically T = k)Alignment:Ungapped extensions until score below statistical thresholdOutput:All local alignments with score > statistical threshold…………queryDBqueryscanCS262 Lecture 4, Win07, BatzoglouIndexing-based local alignment—ExtensionsA C G A A G T A A G G T C C A G TC T G A T C C T G G A T T G C G AGapped extensions until threshold•Extensions with gaps until score < C below best score so farOutput:GTAAGGTCCAGTGTTAGGTC-AGTCS262 Lecture 4, Win07, BatzoglouSensitivity-Speed Tradeofflong words(k = 15)short words(k = 7)SensitivitySpeedKent WJ, Genome Research 2002Sens.SpeedX%CS262 Lecture 4, Win07, BatzoglouSensitivity-Speed TradeoffMethods to improve sensitivity/speed1. Using pairs of words2. Using inexact words3. Patterns—non consecutive positions……ATAACGGACGACTGATTACACTGATTCTTAC…………GGCACGGACCAGTGACTACTCTGATTCCCAG…………ATAACGGACGACTGATTACACTGATTCTTAC…………GGCGCCGACGAGTGATTACACAGATTGCCAG……TTTGATTACACAGAT T G TT CAC GCS262 Lecture 4, Win07, BatzoglouMeasured improvementKent WJ, Genome Research 2002CS262 Lecture 4, Win07, BatzoglouNon-consecutive words—Patterns Patterns increase the likelihood of at least one match within a long conserved region3 common5 common7 commonConsecutive Positions Non-Consecutive Positions6 commonOn a 100-long 70% conserved region: Consecutive Non-consecutiveExpected # hits: 1.07 0.97Prob[at least one hit]: 0.30 0.47CS262 Lecture 4, Win07, BatzoglouAdvantage of Patterns11 positions11 positions10 positionsCS262 Lecture 4, Win07, BatzoglouMultiple patterns•K patternsTakes K times longer to scanPatterns can complement one another•Computational problem:Given: a model (prob distribution) for homology between two regionsFind: best set of K patterns that maximizes Prob(at least one match) TTTGATTACACAGAT T G TT CAC G T G T C CAG TTGATT A GBuhler et al. RECOMB 2003Sun & Buhler RECOMB 2004How long does it take to search the query?CS262 Lecture 4, Win07, BatzoglouVariants of BLAST•NCBI BLAST: search the universe http://www.ncbi.nlm.nih.gov/BLAST/•MEGABLAST: http://genopole.toulouse.inra.fr/blast/megablast.html Optimized to align very similar sequences•Works best when k = 4i 16•Linear gap penalty•WU-BLAST: (Wash U BLAST) http://blast.wustl.edu/ Very good optimizationsGood set of features & command line arguments•BLAT http://genome.ucsc.edu/cgi-bin/hgBlat Faster, less sensitive than BLASTGood for aligning huge numbers of queries•CHAOS http://www.cs.berkeley.edu/~brudno/chaos Uses inexact k-mers, sensitive •PatternHunter http://www.bioinformaticssolutions.com/products/ph/index.php Uses patterns instead of k-mers•BlastZ http://www.psc.edu/general/software/packages/blastz/ Uses patterns, good for finding genes•Typhon http://typhon.stanford.edu Uses multiple alignments to improve sensitivity/speed tradeoffCS262 Lecture 4, Win07, BatzoglouExampleQuery: gattacaccccgattacaccccgattaca (29 letters) [2 mins]Database: All GenBank+EMBL+DDBJ+PDB sequences (but no EST, STS, GSS, or phase 0, 1 or 2 HTGS sequences) 1,726,556 sequences; 8,074,398,388 total letters >gi|28570323|gb|AC108906.9| Oryza sativa chromosome 3 BAC OSJNBa0087C10 genomic sequence, complete sequence Length = 144487 Score = 34.2 bits (17), Expect = 4.5 Identities = 20/21 (95%) Strand = Plus / Plus Query: 4 tacaccccgattacaccccga 24 ||||||| ||||||||||||| Sbjct: 125138 tacacccagattacaccccga 125158 Score = 34.2 bits (17), Expect = 4.5 Identities = 20/21 (95%) Strand = Plus / Plus Query: 4 tacaccccgattacaccccga 24 ||||||| ||||||||||||| Sbjct: 125104 tacacccagattacaccccga 125124 >gi|28173089|gb|AC104321.7| Oryza sativa chromosome 3 BAC OSJNBa0052F07 genomic sequence, complete sequence Length = 139823 Score = 34.2 bits (17), Expect = 4.5 Identities = 20/21 (95%) Strand = Plus / Plus Query: 4 tacaccccgattacaccccga 24 ||||||| ||||||||||||| Sbjct: 3891 tacacccagattacaccccga 3911CS262 Lecture 4, Win07, BatzoglouExampleQuery: Human atoh enhancer, 179 letters [1.5 min]Result: 57 blast hits1. gi|7677270|gb|AF218259.1|AF218259 Homo sapiens ATOH1 enhanc... 355 1e-95 2. gi|22779500|gb|AC091158.11| Mus musculus Strain C57BL6/J ch... 264 4e-68 3. gi|7677269|gb|AF218258.1|AF218258 Mus musculus Atoh1 enhanc... 256 9e-66 4. gi|28875397|gb|AF467292.1| Gallus gallus CATH1 (CATH1) gene... 78 5e-12 5. gi|27550980|emb|AL807792.6| Zebrafish DNA sequence from clo... 54 7e-05 6. gi|22002129|gb|AC092389.4| Oryza sativa chromosome 10 BAC O... 44 0.068 7. gi|22094122|ref|NM_013676.1| Mus musculus suppressor of Ty ... 42 0.27 8. gi|13938031|gb|BC007132.1| Mus musculus, Similar to suppres... 42 0.27gi|7677269|gb|AF218258.1|AF218258 Mus musculus Atoh1 enhancer sequence Length = 1517 Score = 256 bits (129), Expect = 9e-66 Identities = 167/177 (94%), Gaps = 2/177 (1%) Strand = Plus / Plus Query: 3 tgacaatagagggtctggcagaggctcctggccgcggtgcggagcgtctggagcggagca 62 ||||||||||||| ||||||||||||||||||| |||||||||||||||||||||||||| Sbjct: 1144 tgacaatagaggggctggcagaggctcctggccccggtgcggagcgtctggagcggagca 1203 Query: 63 cgcgctgtcagctggtgagcgcactctcctttcaggcagctccccggggagctgtgcggc 122 |||||||||||||||||||||||||| ||||||||| |||||||||||||||| ||||| Sbjct: 1204 cgcgctgtcagctggtgagcgcactc-gctttcaggccgctccccggggagctgagcggc 1262 Query: 123
View Full Document