Stanford CS 262 - T-Coffee: A Novel Method for Fast and Accurate Multiple Sequence Alignment - D2036053

Home> Schools> Stanford University> Computer Science (CS) > CS 262> T-Coffee: A Novel Method for Fast and Accurate Multiple Sequence Alignment

DOC PREVIEW

Stanford CS 262 - T-Coffee: A Novel Method for Fast and Accurate Multiple Sequence Alignment

School name Stanford University

Course Cs 262- Computational Genomics

Pages 13

This preview shows page 1-2-3-4 out of 13 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 13 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 13 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 13 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 13 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 13 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

IntroductionT-Coffee AlgorithmFigure 01Figure 02Figure 03Generating a primary library of alignmentsDerivation of the primary library weightsCombination of the librariesExtending the libraryProgressive alignment strategyBiological validation of the resultsComparison with other methodsStatistical validationImplementationResultsTable 1Table 2Combining local and global alignments without extensionEffect of the library extensionComparing T-Coffee with other multiple sequence alignment methodsApplication to serine/threonine kinasesEfficiencyDiscussionFigure 04ReferencesT-Coffee: A Novel Method for Fast and AccurateMultiple Sequence AlignmentCeÂdric Notredame1,2,3*, Desmond G. Higgins4and Jaap Heringa11National Institute for MedicalResearch, The Ridgeway, MillHill, London NW7 1AA, UK2ISREC, 155, Ch. desBoveresses, CH, 1066Epalinges/s LausanneSwitzerland3Information Genetique etStructurale, CNRS-UMR 188931 Ch. Joseph Aiguier13402 Marseille, France4Department of BiochemistryUniversity College, CorkIrelandWe describe a new method (T-Coffee) for multiple sequence alignmentthat provides a dramatic improvement in accuracy with a modest sacri-®ce in speed as compared to the most commonly used alternatives. Themethod is broadly based on the popular progressive approach to mul-tiple alignment but avoids the most serious pitfalls caused by the greedynature of this algorithm. With T-Coffee we pre-process a data set of allpair-wise alignments between the sequences. This provides us with alibrary of alignment information that can be used to guide the progress-ive alignment. Intermediate alignments are then based not only on thesequences to be aligned next but also on how all of the sequences alignwith each other. This alignment information can be derived from hetero-geneous sources such as a mixture of alignment programs and/or struc-ture superposition. Here, we illustrate the power of the approach byusing a combination of local and global pair-wise alignments to generatethe library. The resulting alignments are signi®cantly more reliable, asdetermined by comparison with a set of 141 test cases, than any of thepopular alternatives that we tried. The improvement, especially clearwith the more dif®cult test cases, is always visible, regardless of the phy-logenetic spread of the sequences in the tests.# 2000 Academic PressKeywords: pair-wise alignment; progressive alignment; local alignment;global alignment; multiple sequence alignment*Corresponding authorIntroductionThe simultaneous alignment of three or morenucleotide or amino acid sequences is one of thecommonest tasks in bioinformatics. Multiple align-ments are an essential pre-requisite to manyfurther analyses of protein families such as hom-ology modeling or phylogenetic reconstruction, orare simply used to illustrate conserved and vari-able sites within a family. These alignments maybe further used to derive pro®les (Gribskov et al.,1987) or hidden Markov models (Bucher et al.,1996; Haussler et al., 1993) that can be used toscour databases for distantly related members ofthe family.The automatic generation of an accurate multiplealignment is potentially a daunting task. Ideally,one would make use of an in-depth knowledge ofthe evolutionary and structural relationshipswithin the family, but this information is oftenlacking or dif®cult to use. General empiricalmodels of protein evolution (Benner et al., 1992;Dayhoff, 1978; Henikoff & Henikoff, 1992) arewidely used instead, but these can be dif®cult toapply when the sequences are less than 30 % iden-tical (Sander & Schneider, 1991). Further, math-ematically sound methods for carrying outalignments, using these models, can be extremelydemanding in computer resources for more than ahandful of sequences (Carrillo & Lipman, 1988;Wang & Jiang, 1994). In practice, heuristic methodsare used for all but the smallest data sets.The most commonly used heuristic methods arebased on the progressive-alignment strategy (Feng& Doolittle, 1987; Hogeweg & Hesper, 1984;Taylor, 1988). with ClustalW (Thompson et al.,1994) being the most widely used implementation.The idea is to take an initial, approximate, phylo-genetic tree between the sequences and to gradu-ally build up the alignment, following the order inthe tree. Although successful in a wide variety ofcases, this method suffers from its greediness.Errors made in the ®rst alignments cannot be recti-®ed later as the rest of the sequences are added in.T-Coffee is an attempt to minimize that effect, andE-mail address of the corresponding author:[email protected] used: 3D, three-dimensional; SP, sum-of-pairs; NBS(s), nucleotide-binding site(s).doi:10.1006/jmbi.2000.4042 available online at http://www.idealibrary.com onJ. Mol. Biol. (2000) 302, 205±2170022-2836/00/010205±13 $35.00/0 # 2000 Academic Pressalthough the strategy we propose here is also agreedy progressive method, it allows for much bet-ter use of information in the early stages, as wewill see below.The main alternative to progressive alignment isthe simultaneous alignment of all the sequences.Two such packages exist (MSA (Lipman et al.,1989) and DCA (Stoye et al., 1997)), based on theCarrilo and Lipman (1988) algorithm, but theyremain an extremely CPU and memory-intensiveapproach. Iterative strategies (Gotoh, 1996;Notredame & Higgins, 1996) are another interest-ing alternative. They do not provide any guaran-tees about ®nding optimal solutions but arereasonably robust and much less sensitive to thenumber of sequences than their deterministiccounterparts.All of these methods attempt to carry out globalalignments, where one tries to align the full lengthsof the sequences with each other. Alternatively,one might wish to consider local similarity, asoccurs when two proteins share only a domain ormotif. For two-sequence comparisons, there is thewell-known Smith and Waterman (1981) algor-ithm. Here we use Lalign (Huang & Miller, 1991),from the FASTA package (Pearson & Lipman,1988), which is a variant of the Smith and Water-man method. It produces sets of non-overlappinglocal alignments from the comparison of twosequences. For multiple sequences, the Gibbs sam-pler (Lawrence et al., 1993) and Dialign2(Morgenstern, 1999) are the main automaticmethods. These programs often perform well whenthere is a clear block of ungapped alignmentshared by all of the sequences. They performpoorly, however, on general sets of test cases whencompared with global methods (Thompson et

View Full Document