Stanford CS 262 - PatternHunter - faster and more sensitive homology search - D1728349

Home> Schools> Stanford University> Computer Science (CS) > CS 262> PatternHunter - faster and more sensitive homology search

DOC PREVIEW

Stanford CS 262 - PatternHunter - faster and more sensitive homology search

School name Stanford University

Course Cs 262- Computational Genomics

Pages 6

This preview shows page 1-2 out of 6 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 6 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 6 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 6 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

BIOINFORMATICSVol. 18 no. 3 2002Pages 440–445PatternHunter: faster and more sensitivehomology searchBin Ma1, John Tromp2and Ming Li31Computer Science Department, University of Western Ontario, London N6A 5B8,Canada,2Bioinformatics Solutions Inc., 145 Columbia Street West, Waterloo,Ont. N2L 3L2, Canada and3Computer Science Department, University of Waterloo,Waterloo, Ont. N2L 3G1, Canada and Bioinformatics Lab, Computer ScienceDepartment, University of California, Santa Barbara, CA 93106, USAReceived on August 24, 2001; revised on October 10, 2001; accepted on October 15, 2001ABSTRACTMotivation: Genomics and proteomics studies routinelydepend on homology searches based on the strategy offinding short seed matches which are then extended. Theexploding genomic data growth presents a dilemma forDNA homology search techniques: increasing seed sizedecreases sensitivity whereas decreasing seed size slowsdown computation.Results: We present a new homology search algorithm‘PatternHunter’ that uses a novel seed model for in-creased sensitivity and new hit-processing techniques forsignificantly increased speed. At Blast levels of sensitivity,PatternHunter is able to find homologies between se-quences as large as human chromosomes, in mere hourson a desktop.Availability: PatternHunter is available at http://www.bioinformaticssolutions.com, as a commercial package. Itruns on all platforms that support Java. PatternHuntertechnology is being patented; commercial use requires alicense from BSI, while non-commercial use will be free.Contact: [email protected] are interested in faster and more sensitive methodsfor finding all approximate repeats or homologies in oneDNA sequence or between two DNA sequences, as per-formed by the popular Blastn (Altschul et al., 1990) pro-gram. One particular application of this task is in com-parative genomics where large genomes or chromosomessuch as the human one (International Human Genome Se-quencing Consortium, 2001; Venter et al., 2001) need tobe compared.Many programs have been developed for the task. Theseinclude FASTA (Lipman and Pearson, 1985), SIM (Huangand Miller, 1991), the Blast family (Altschul et al., 1990;Gish, 2001; Altschul et al., 1997; Zhang et al., 2000;Tatusova and Madden, 1999), SENSEI (States, 2000),MUMmer (Delcher et al., 1999), QUASAR (Burkhardt etal., 1999), and REPuter (Kurtz and Schleiermacher, 1999).Smith–Waterman alignment which compares all basesagainst all bases is clearly too slow. Two lines of approachlead to improvements. The first is exemplified by Blast,which is used routinely by thousands of scientists. Thisapproach finds short exact ‘seed’ matches (hits), whichare then extended into longer alignments. However, whencomparing two very long sequences, FASTA, SIM, Blastn(BL2SEQ), WU-Blast, and Psi-Blast run very slow andneed large amounts of memory. SENSEI is somewhatfaster and uses much less memory than the above pro-grams, but is currently limited to ungapped alignments.MegaBlast runs quite efficiently with its default gap scoresand large seed length of 28 but turns out to have worseoutput quality and doesn’t scale as well to huge sequences.Another line of approach, exemplified by MUMmer,QUASAR and REPuter, uses suffix trees. Suffix treessuffer from two problems: they are meant to deal withprecise matches and are limited to comparison of highlysimilar sequences (Delcher et al., 1999; Burkhardt et al.,1999; Kurtz and Schleiermacher, 1999). They are veryawkward in handling mismatches. The second problemwith suffix trees is that they have an intrinsic large spacerequirement.We introduce novel seeding schemes and hit-processingmethods, which are implemented in our program Pattern-Hunter. On a modern desktop, its running time rangesfrom seconds for prokaryotic genomes to minutes forArabidopsis chromosomes to hours for human chromo-somes, with very modest memory use, and at provablyhigher sensitivity than the default Blastn.SELECTING GOOD SEEDS: EXPECT LESS TOGET MOREA dilemma for a Blast type of search is that large seedslose distant homologies while small ones creates too many440c Oxford University Press 2002PatternHunterrandom hits which slow down the computation. We use anew idea that allows us to have a higher probability of ahit in a homologous region, even while having somewhatlower expected number of random hits.Blast looks for matches of k (default k = 11 in Blastnand k = 28 in MegaBlast) consecutive letters as seeds.Instead we propose to use nonconsecutive k letters asseeds. We call the relative positions of the k letters amodel, and k its weight.This seemingly simple change has a surprisingly largeeffect on sensitivity. An appropriately chosen model canhave a significantly higher probability of having at leastone hit in a homologous region, compared to Blast’sconsecutive seed model, even while having a lowerexpected number of hits†. For example, in a region oflength 64 with 70% identity, Blast’s consecutive weight 11model has a 0.30 probability of having at least onehit in the range, while a nonconsecutive model of thesame weight has a 0.466 probability of getting a hit,see Figure 1. On the other hand, the expected numberof hits in that region by the Blast consecutive modelis 1.07, while the nonconsecutive model expects 0.93hits. This is because the length 11 model can shift over54 places within the length 64 window, while the length18 model has only 47 places to fit. The reason for theincreased sensitivity is that the events, of having a match atdifferent positions, become more independent for spacedmodels. If a model and a shifted copy share many 1s inthe same position, then a base mismatch in any of theseshared positions will make both matches fail, hence thecorresponding matching events are far from independent.Independent events are better at pooling their successprobabilities together. Generally, the fewer bases sharedby a model and any of its shifted copies, the higher itssensitivity is. Clearly, by this measure, consecutive modelsare the worst, since shift of 1 shares all but one bases.For convenience, we denote a model by a 0–1 string,where the 1-positions represent required matches, whilethe 0s are ‘don’t cares’. For example, if we use a weightsix model 1110111, thenactgact versus acttact is aseed match, as well as actgact versus actgact. So Blastuses models of the form 1k. Blast actually matches twoor three bytes, each containing four bases, simultaneously,and

View Full Document