Stanford CS 374 - Lecture Notes - D727224

Home> Schools> Stanford University> Computer Science (CS) > CS 374> Lecture Notes

Stanford CS 374 - Lecture Notes

Pages 40

Download Save

Unformatted text preview:

Using Multiple Alignments to Improve Seeded Local Alignment AlgorithmsJason Flannick1,2 and Serafim Batzoglou1,21Department of Computer Science, Stanford University, Stanford, CA 943042Email [email protected], or [email protected] alignments among genomes are becoming increasingly prevalent. This trend motivates the development of tools for efficient homology search between a query sequence and a database of multiple alignments. In this paper, we present an algorithm that uses the information implicit in a multiple alignment to dynamically build an index that is weighted most heavily toward promising regions of the multiple alignment. We have implemented Typhon, a local alignment tool that incorporates our indexing algorithm, which our test results show to be more sensitive than algorithms that index only a sequence. This suggests that when applied on a whole-genome scale, Typhon should provide improved homology searches in time comparable to existing algorithms.IntroductionSequence alignment is certainly one of the most well-developed and pervasive topics of computational molecular biology. Algorithms in this vein are widely used for tasks varying from the comparative analysis of rodent (1-5) and chicken (6) genomes to the construction of networks of protein interactions (7). With the current sequencing of many genomes (8), fast and sensitivesequence alignment algorithms will likely maintain or increase their role in biological research.As more and more genomic data has become available, algorithms for locally aligning query sequences to genomic databases have become increasingly important (9-13). Because the exact Smith-Waterman (14) algorithm is impractical for large sequences, database search techniques are almost always based upon the paradigm of seeded alignments. The BLAST algorithm (10) was pivotal in popularizing such a technique, and it has since been incorporated into many tools, a few of which are BLASTZ (4), BLAT (13), and Exonerate (15). In such algorithms, a set of seeds is first generated between the database and the query. Each seed is then extended to determine whether it is part of a high scoring local alignment. Extensions typically consist of two phases: first the seed is extended into an un-gapped alignment, and if this alignment scores above a threshold, the seed is then extended with the allowance of gaps. An enhancement to this simple model is to extend only pairs of seeds close to each other (11). Seeds for the BLAST algorithm are traditionally fixed-length words present in both the database and the query, with the word length referred to as the seed’s weight. This leads to an inevitable speed/sensitivity trade-off; heavier seeds prune a larger fraction of the search space but miss more alignments than do seeds with a smaller weight.In recent years, the introduction of spaced seeds has led to significantly improved local alignment algorithms (12, 16 -20). Spaced seeds allow non-contiguous patterns of matching nucleotides to initiate a local alignment, and algorithms have been developed (17–22) to compute the probability that a seed will be found within an un-gapped alignment of a given length between two sequences. The optimal seed can then be chosen as the seed that maximizes this probability. It is useful to think of un-gapped alignments of homologous regions as being generated by a probabilistic model that specifies a distribution over matches and mismatches (17, 20, 22). The model outputs a bit string where each position corresponds to a position in the alignment; the bit is 1 if there is a match in the alignment and 0 if there is a mismatch. While higher-order modelsare possible (19), in this paper we will focus on models that output a 1 independently in each position with a fixed probability, which is called the similarity level (12).In addition to being provably more sensitive than consecutive seeds in some cases (21), spaced seeds allow an important new speed/sensitivity trade-off. Rather than lowering the weight of a seed to boost sensitivity, one can index multiple seeds per position and obtain a linear, rather than exponential, rise in the size of the search space (18, 20). Spaced seed design operates under a resource-constrained paradigm (17), where the weight and number of seeds is specified and the goal is to design an optimal set of seeds that fits these constraints.In this paper, we seek to build on these developments by taking advantage of increasing amounts of available genomic data as well as rapidly improving global multiple sequence alignment algorithms (23–26). We predict that in the near future, these trends will lead to the proliferation of genomic databases consisting of multiple alignments. Information implicit in an alignment has been used to aid in a variety of bioinformatics tasks (27-30), and, similarly, one can hope that a multiple alignment can be utilized to improve database search algorithms. Previous research on searching between multiple alignments has concentrated on position specific scoring schemes (11,31,32). PSI-BLAST (11) is the most popular such program; given a query sequence, it builds a multiple alignment, or profile, from a set of high scoring alignments of the query to the database. It then uses the constructed profile to iterate searches for improved sensitivity. Approaches in this vein have been successful, but in this paper, our focus is orthogonal to such techniques.The problem we tackle is to align a query sequence to a fixed multiple alignment database. As an example, it may be desirable to augment a multiple alignment of mammalian genomes with a newly sequenced mammalian or vertebrate genome. Our approach uses the multiple alignment database to improve search sensitivity over that obtained using only a sequence database. To do this, we extend the resource-constrained paradigm to apply not only to seed design but also to seed allocation; we allow different positions in the database to index different sets of seeds and determine the best way to do so based on the information implicit in the multiple alignment. We have implemented a local alignment tool, Typhon, which incorporates our indexing algorithm. Tests on real world data show that Typhon is substantially more sensitive than standard sequence indexing algorithms as well as algorithms that index multiple alignmentswithout using our dynamic indexing methodology. The performance improvement is most dramatic for

View Full Document


School:
Email:
New Password:
Confirm Password:

Stanford CS 374 - Lecture Notes

Sign up for free to view:

Please select your school