DOC PREVIEW
Stanford CS 374 - Study Notes

This preview shows page 1-2-3 out of 9 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 9 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 9 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 9 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 9 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Designing Multiple Simultaneous Seeds forDNA Similarity SearchYanni [email protected] of Computer Scienceand EngineeringWashington UniversitySt. Louis, MO 63130Jeremy [email protected] of Computer Scienceand EngineeringWashington UniversitySt. Louis, MO 63130ABSTRACTThe challenge of similarity search in massive DNA sequencedatabases has inspired major changes in BLAST-style align-ment tools, which accelerate search by inspecting only pairsof sequences sharing a common short “seed,” or pattern ofmatching residues. Some of these changes raise the possi-bility of improving search performance by probing sequencepairs with several distinct seeds, any one of which is suf-ficient for a seed match. However, designing a set of seedsto maximize their combined sensitivity to biologically mean-ingful sequence alignments is computationally difficult, evengiven recent advances [16, 6] in designing single seeds.This work describes algorithmic improvements to seed de-sign that address the problem of designing a set of n seeds tobe used simultaneously. We give a new lo cal search methodto optimize the sensitivity of seed sets. The method relieson efficient incremental computation of the probability thatan alignment contains a match to a seed π, given that ithas already failed to match any of the seeds in a set Π. Wedemonstrate experimentally that multi-seed designs, evenwith relatively few seeds, can be significantly more sensitivethan even optimized single-seed designs.Categories and Subject DescriptorsF.2.2 [Analysis of Algorithms and Problem Complex-ity]: Nonnumerical Algorithms and Problems—patternmatching; H.3.3 [Database Management]: InformationSearch and Retrieval—information filtering, search processGeneral TermsAlgorithms, Design, PerformanceKeywordsbiosequence comparison, similarity search, s eed design,genomic DNA, MandalaPermission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.RECOMB’04, March 27–31, 2004, San Diego, California, USA.Copyright 2004 ACM 1-58113-755-9/04/0003 ...$5.00.1. INTRODUCTIONSimilarity search using large DNA databases is criticalto identifying biologically meaningful features in genomes.Comparing a database of known features to an unannotatedgenome can help identify genes and remains the method ofchoice for finding non-gene features such as transposable ele-ments [23] and putative regulatory regions [10, 15]. Pairwisecomparison of entire mammalian genomes has been used tosupport gene-structure prediction [14], long-range orthologydetection [20], and inference of genome rearrangements [19].Mo de rn biosequence databases place substantial compu-tational demands on tools for fast detection of similarity.These demands, whose growth parallels the exponentiallyincreasing size of the databases [17], have revived innovationin heuristics for fast similarity search in DNA [21, 12, 16, 6].All of these heuristics, like their ancestors BLASTN [1] andFASTN [18], work by first identifying short seed matchesbetween DNA sequences, then more carefully aligning thosepairs of sequences that exhibit one or more seed matches.In this work, a seed π is defined to be an ordered list ofindices {x1. . . xw}. To ensure that each possible seed has aunique representation in this notation, we fix x1= 0 for allseeds. We say that two sequences S and T exhibit a seedmatch at offsets i and j if, for 1 ≤ k ≤ w, S[i + xk] = T [j +xk]. The number of inspected positions w is the weight of π,while the distance s = xw− x1+ 1 is its span. Traditionally,seed matching heuristics have used the contiguous seed πc={0, 1, . . . w−1}, but more recent work [16, 6, 3] has shown thedesirability of utilizing seeds which inspect a discontiguousset of sequence positions.In a large genomic DNA comparison, the seed match-ing phase, which entails a complete scan of the sequencedatabase, can easily take 50% or more of total computa-tion time. One way to reduce this cost is to move it offlineby preprocessing the database. Preprocessing-based searchengines like FLASH [7] and BLAT [12] index a databaseD so that all seed matches between D and some query se-quence q can be found in time proportional to q’s length plusthe number of matches. Dep e nding on the density of seedmatches, prepro ces sed searches may cost an order of mag-nitude less [12] than a linear scan of D. A second way toaccelerate seed matching is to delegate it to specialized hard-ware, such as a field-programmable gate array (FPGA) [9] ornetwork processor [22]. Such hardware, combined with high-speed I/O from the database’s storage, could greatly reducethe time needed to scan D while freeing the hos t processor76to investigate any seed matches in parallel. Preprocessingand hardware-accelerated seeded alignment are therefore at-tractive options for designing new, faster similarity searchto ols.Preprocessing- and hardware-based search share one fea-ture of critical importance for algorithm design: both ap-proaches can efficiently utilize multiple seeds at once. LetΠ = {π1. . . πn} be a set of seeds, and suppose we seek allseed matches induced by at least one seed from Π. Pre-processing produces one index for each seed π ∈ Π, requir-ing n lookups instead of one; however, for reasonable n,this cost is still negligible compared to a linear scan of thedatabase. Hardware-accelerated search engines can exploitthe fine-grained parallelism and wide data paths of special-ized hardware to perform n single-seed searches in parallelon the database as it streams through the device.Alternative search technologies open up an algorithmic de-sign space of multi-seed heuristics, but navigating this spaceis nontrivial. For single seeds, the PatternHunter search en-gine [16] and our own work on the Mandala seed designsoftware [6] have shown the utility of designing a seed tooptimize sensitivity in a probabilistic model M of the align-ments being sought. However, Mandala’s design methods,while applicable to multi-seed design, scale poorly with thesize n of the seed set. When faced with a need for multi-ple seeds, previous work has generally resorted to randomseed


View Full Document

Stanford CS 374 - Study Notes

Documents in this Course
Probcons

Probcons

42 pages

ProtoMap

ProtoMap

19 pages

Lecture 3

Lecture 3

16 pages

Load more
Download Study Notes
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Study Notes and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Study Notes 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?