CORNELL CS 726 - Multiple Sequence Alignment Using Partial Order Graphs

Unformatted text preview:

ABSTRACTINTRODUCTIONProgressive Alignment and the MSA Representation ProblemALGORITHMIMPLEMENTATIONPartial Order Structure in Biological Sequence AlignmentsDISCUSSIONCONCLUSIONACKNOWLEDGEMENTSREFERENCESFigure LegendsTABLE 11Multiple Sequence Alignment Using Partial OrderGraphsChristopher Lee*, Catherine Grasso, Mark SharlowDepartment of Chemistry and BiochemistryUniversity of California, Los AngelesLos Angeles, CA 90095-1570TEL: 310-825-7374FAX: 310-267-0248EMAIL: [email protected]*Author to whom correspondence should be addressed.Draft September 24, 2001Printed September 24, 20012ABSTRACTMotivation: Progressive multiple sequence alignment (MSA) methods depend on reducing an MSA to a linear profile for each alignment step. However, this leads to loss of information needed for accurate alignment, and gap scoring artifacts. Results: We present a graph representation of an MSA that can itself be aligned directly by pairwise dynamic programming, eliminating the need to reduce the MSA to a profile. This enables our algorithm (POA: partial order alignment) to guarantee that the optimal alignment of each new sequence versus each sequence in the MSA will be considered. Moreover, this algorithm introduces a new edit operator, homologous recombination, important for multidomain sequences. The algorithm has improved speed (linear time complexity) over existing MSA algorithms, enabling construction of massive and complex alignments (e.g. an alignment of 5000 sequences in 4 hours on a Pentium II). We demonstrate the utility of this algorithm on a family of multidomain SH2 proteins, and on EST assemblies containing alternative splicing and polymorphism. Availability: The partial order alignment program POA is available at http://www.bioinformatics.ucla.edu/poa. Contact: Christopher Lee, [email protected] sequence alignment (MSA) is one of the most important tools in bioinformatics, and has a long history of study in computer science and bioinformatics (Needleman & Wunsch, 1970; Smith & Waterman, 1981; Taylor, 1986; Barton & Sternberg, 1987; Higgins & Sharp, 1988; Altschul, 1989; Lipman et al., 1989; Subbiah & Harrison, 1989; Gotoh, 1993; Thompson et al., 1994; Gotoh, 1996; Notredame et al., 2000; Gordon et al., 2001). It continues to be an active field of research, because it poses computational challenges that are only partially solved by existing algorithms. For pairwise sequence alignment (of just two sequences), a globally optimal solution can be found in O(L2) time by dynamic programming, where L is the length of the sequences. This algorithm can be extended to align N sequences optimally, but requires O(LN) time. Whereas the pairwise alignment time of O(L2) is acceptable for typical biological sequences of genes and proteins (L<10,000), the exponential time required for aligning larger numbers of sequences by dynamic programming is impractical. Thus optimal alignment of pairs of sequences is performed routinely, but optimal alignment of multiple sequences is generally not possible.Instead, a variety of heuristic MSA algorithms have been developed, nearly all of them based on progressive application of pairwise sequence alignment to build up alignments of larger numbers of sequences as proposed by Feng and Doolittle (Feng & Doolittle, 1987). An excellent example of the progressive alignment approach is CLUSTAL, initially released by Higgins, et al. in 1988 (Higgins & Sharp, 1988). This algorithm builds a multiple sequence alignment through a series of pair-wise alignments. Initially, all of the sequences are aligned pair-wise resulting in N(N-1)/2 alignments. The scores of these alignments are then used to construct a binary tree of their evolutionary relationships. Finally, the algorithm builds a multiple sequence alignment in the order dictated by the evolutionary tree: the most recently diverged sequences are aligned first resulting in N/2 alignment profiles; the N/2 alignment profiles are aligned to each other resulting in N/4 alignment profiles; and so forth, until all of the sequences have been aligned, resulting in a single alignment profile of the multiple sequence alignment of all N sequences. Finding the scores and constructing the binary evolutionary tree is O(NL2).4Using the binary tree to build the multiple sequence alignment is O(L2logN). Consequently, the overall algorithm runs in polynomial time and can be used to align over a hundred sequences in a reasonable amount of time. As pointed out by Gibson and co-workers (Thompson et al., 1994), this algorithm is greedy by nature and may find a local minimum either because the guide tree is not correct or because alignment errors that happen early on in the process of building the multiple sequence alignment get locked in. As a consequence, this algorithm does not guarantee an optimal multiple sequence alignment. Many MSA methods have been developed with different advantages and disadvantages (Taylor, 1986; Barton & Sternberg, 1987; Altschul, 1989; Lipman et al., 1989; Subbiah & Harrison, 1989; Gotoh, 1993; Thompson et al., 1994; Gotoh, 1996; Notredame et al., 2000; Gordon et al., 2001), but globally optimal alignment of larger numbers of sequences remains impractical.Gibson and coworkers (Thompson et al., 1994) highlighted two major problems with the progressive multiple sequence alignment approach: the local minimum problem, mentioned above, and the choice of appropriate alignment parameters. Their solution was to weight sequences according to their degree of similarity, and to vary gap penalties and substitution matrices during not only the dynamic programming but also the course of building up the full alignment. This strategy was implemented in CLUSTALW, which has been one of the most commonly used multiple sequence alignment programs since 1994 (Thompson et al., 1994). While we agree with this analysis, we believe there are additional problems in progressive alignment, whose resolution can yield further improvements. Specifically, we view handling of gaps and insertions as one of the most troublesome and least rigorously formulated aspects of multiple sequence alignment. We wish to emphasize that this is not a criticism of any existing MSA method, but rather a way of highlighting opportunities for fundamental rethinking of the problem. We will use the term “progressive alignment” to designate the general class of MSA build-up algorithms, and our usages of this term


View Full Document

CORNELL CS 726 - Multiple Sequence Alignment Using Partial Order Graphs

Download Multiple Sequence Alignment Using Partial Order Graphs
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Multiple Sequence Alignment Using Partial Order Graphs and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Multiple Sequence Alignment Using Partial Order Graphs 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?