Comparative Analysis of Methods for Multiple Sequence Alignment Yong Yang Abstract A total of five different multiple sequence alignment methods were analyzed in this study The purpose of this study is to evaluate each method s ability to correctly align highly conserved motifs and secondary structure elements including helix and strand in the input sequences especially to avoid putting gaps in those regions Forty six protein sequences from the thymidylate kinase family were used to test the methods Among the methods tested the Hidden Markov Model HMM method generated the best alignment Introduction The simultaneous alignment of a number of nucleotide or amino acid sequences to produce optimal global or local alignment is of great utility in molecular biology Multiple alignments are used to find motifs or conserved domains to characterize families of sequences It can also be used to identify new members of existing protein families to assign function to new sequences If the structure of some members of a family is known then multiple sequence alignments can help predict the secondary and tertiary structure of other family members Multiple sequence alignment can also be used for phylogenetic or evolutionary analysis Automatic multiple sequence alignment is an important topic in computational biology The importance of multiple sequence alignment algorithms is clearly evidenced by the large number of programs available for this task They can be divided into four categories 1 Progressive global alignment of the sequences starting with alignment of the most similar pair of sequences and then adding more sequences to the alignment by aligning sequences to the consensus of the alignment already generated ClustalW 41 and Pileup 6 belong to this category 2 Iterative alignment methods that make an initial alignment of the input sequences and then try to iteratively refine the initial alignment to achieve better results DIALIGN 25 26 and SAGA 30 belong to this category 3 optimal local alignments of common subsequences among input sequences BLOCKMAKER 9 11 MACAW 39 eMotif 13 29 MOTIF and Profile Maker belong to this category 4 Use of statistical methods and probabilistic models of the sequences Hidden Markov Models 14 17 19 MEME 2 4 8 Gibbs sampler 23 28 belong to this category Multiple sequence alignments are usually inferred from primary sequence alone In multiple sequence alignment residues among a set of sequences are aligned together in Page 1 30 columns The residues in one column are supposed to be structurally or evolutionarily related So ideally a column of aligned residues should occupy similar three dimensional structural positions or all residues in that column diverge from a common ancestral residue The gold standard for multiple structure alignments is the structural alignments One difficulty in making sequence alignments reflect structural similarity is that gaps in the alignment should be confined to regions not in the core Alignments that reflect structures in the core regions should have few if any gaps The core region consists of secondary structure elements including helix and strand So in a good multiple alignment there should be few gaps in helix and strand since gaps will usually disturb the structure Also there should be few gaps in those very conserved motifs or domains for a protein family In this study five different multiple sequence alignment methods were analyzed to evaluate their ability to correctly place gaps in helix strand and highly conserved motifs Materials and Methods Test Data Set Forty six thymidylate kinases from prokaryotes and eukaryotes were chosen as data set to test multiple alignment methods The Swiss Prot ID for these forty six proteins are as follows KTHY AQUAE KTHY THEMA KTHY BACHD KTHY BACSU KTHY LACLA KTHY HALN1 KTHY ECOLI KTHY YERPE KTHY HAEIN KTHY PASMU KTHY VIBCH KTHY BUCAI KTHY CAUCR KTHY DEIRA KTHY NEIMA KTHY NEIMB KTHY PSEAE KTHY CHLMU KTHY CHLTR KTHY CHLPN KTHY MYCGE KTHY MYCPN KTHY UREPA KTHY RICPR KTH1 SULSO KTHY ARCFU KTHY THEAC KTHY XYLFA KTHY AERPE KTHY METTH KTHY METJA KTHY HELPJ KTHY HELPY KTHY CAMJE KTHY SYNY3 KTHY PYRAB KTHY PYRHO KTHY CAEEL KTHY SCHPO KTHY HUMAN KTHY MOUSE KTHY YEAST KTHY VACCV KTHY VARV KTH2 SULSO KTHY ASFB7 The Prosite motif for this family is PS01331 LIV LIVMGSTC DET RH FYHCS x 2 S GSTNP x AVC FY STANQ The X ray structure of Escherichia coli and yeast thymidylate kinase was solved The PDB ID for these two structures are 4TMK and 3TMK respectively The secondary structure of Escherichia coli thymidylate kinase Fig 1 and yeast thymidylate kinase Fig 2 are very similar to each other As you can see from the superimposition of 3D structures of these two proteins Fig 3 the 3D structure of these two kinases are very similar which suggests that thymidylate kinase in prokaryote and eukaryotes may take similar structure Page 2 30 Multiple alignment programs tested Five different multiple alignment programs were tested in this study ClustalW PSI BLAST HMM BLOCK MAKER Pileup Please see Table 1 for details about these five programs Fig 1 Secondary structure of Escherichia coli thymidylate kinase PDB ID 4TMK Fig 2 Secondary structure of Yeast thymidylate kinase PDB ID 3TMK Page 3 30 Fig 3 3D Superimposition of Escherichia coli thymidylate kinase and yeast thymidylate kinase Table 1 Multiple alignment programs used for comparative analysis in this study Program Name ClustalW HMM Pileup BLOCKMAKER PSI BLAST Source Reference ClustalW 1 81 installed in Eli Lilly and Company Bioinfo server http www cse ucsc edu research compbio HMMapps T99 tuneup html Pileup in GCG package installed in Eli Lilly and Company Bioinfo server 41 http www ncbi nlm nih gov BLAST 1 14 17 19 A part of GCG package of sequence analysis programs Derived from 6 http www blocks fhcrc org blockmkr make blocks html 9 11 Page 4 30 Results Five different multiple alignment programs listed in Table 1 were analyzed to evaluate each program s ability to correctly place gaps in helices strands and those highly conserved motifs identified by eMOTIF SEARCH and eMATRIX SEARCH The structure of Escherichia coli thymidylate kinase were used as a gold standard for evaluating the output alignment from each of the five multiple alignment program There are totally eight helices and four strands in Escherichia coli thymidylate kinase Totally four statistically significant motifs were identified in Escherichia coli thymidylate kinase by both eMOTIF SEARCH and eMATRIX
View Full Document
Unlocking...