Ross Metzger June 4 2004 Biochemistry 218 Multiple Alignment of Genomic Sequences Genomic sequence is currently available from ENTREZ for more than 40 eukaryotic and 157 prokaryotic organisms As part of the ongoing NIH Intramural Sequencing Center s Comparative Vertebrate Sequencing project genomic sequences will soon be available from 50 vertebrates for regions orthologous to defined regions of the human genome Managing and interpreting these sequence data requires new computational tools including programs designed to align multiple genomic sequences Biologists can use such alignments to identify functional elements coding regions and transcription factor binding sites as well as highly conserved elements whose exact function s remain to be determined e g the recently described ultraconserved elements Bejerano et al 2004 to understand the evolution of genome sequence and structure and for phylogenetic analysis for reviews see Boffelli et al 2004 Dubchak and Frazer 2003 Frazer et al 2003 Ureta Vidal et al 2003 The goal of an alignment program is to align orthologous positions i e positions in the sequences to be aligned that descend from the same position in the ancestral sequence Programs should be as sensitive as possible aligning as much orthologous sequence as possible but should also be as precise as possible only orthologous sequences should be aligned Non orthologous sequences should either not be aligned or matched to a gap Alignment programs can be used to align multiple whole genomes or to align multiple large genomic sequences Genomes evolve by rearrangements inversions and duplications and contain repetitive elements all of which can pose problems for alignment tools Programs that use a global alignment strategy assume that orthologous regions are found in the same order in all the sequences to be aligned For whole genomes this assumption then is false Local alignment programs can detect transpositions inversions and duplications but may do worse than global aligners at detecting orthologous regions in widely diverged sequences I will discuss five programs MultiPipMaker Multi LAGAN CHAOS DIALIGN MAVID and TBA designed to align multiple genomic sequences that produce local or global multiple alignments Those programs that produce global multiple alignments all except MultiPipMaker assume that order within the orthologous sequences to be aligned is conserved Global alignment programs can be used to align whole genomes if the genomes are first broken down into chunks e g by local aligners in which conservation of order is assumed See for example Brudno et al 2004 This may not always be the case however because small scale rearrangements can occur Kent et al 2003 Repetitive elements are dealt with either by removing them before aligning the sequences or masking them initially so that they are allowed to be aligned only if they are adjacent to aligned non repeat regions Both of these approaches require that species specific repetitive sequences can be identified 1 Programs for aligning multiple genomic sequences MultiPipMaker MultiPipMaker available as a web based server http bio cse psu edu pipmaker generates true multiple alignments of long DNA sequences Schwartz et al 2003a It returns all local alignments that score above a specified threshold MultiPipMaker begins by generating a multiple alignment using local pairwise alignments between a reference sequence and each of the other sequences computed by the BLASTZ program Schwartz et al 2003b This initial crude multiple alignment is then refined to generate a true multiple alignment BLASTZ is a local alignment tool which generates a set of local alignments using a Gapped BLAST like strategy BLASTZ finds short near exact matches sequences must match at 12 specific positions within runs of 19 nucleotides a transition is allowed at any one of the 12 positions These matches are then extended in both directions not allowing gaps until the score drops below some threshold The scores of low complexity sequence matches are downweighted Ungapped matches that score above a certain threshold are then extended using a dynamic programming method that allows for gaps BLASTZ then searches in between each pair of adjacent alignments for 7 mer exact matches and allows a lower threshold to determine which ungapped matches to extend The idea is to use less strict criteria to align sequences in between those that align based on stricter criteria The local pairwise alignments are pruned to eliminate any overlaps and then strung together These strung together pairwise alignments then contain gaps within the local alignments which are penalized using affine gap penalties and gaps between local alignments In constructing and refining the multiple alignment gaps between these local alignments are not penalized The crude multiple alignment is assembled from these strung together pairwise alignments with the common reference sequence and then refined using an iterative procedure designed to produce an optimal multiple alignment score Each sequence within defined sub regions a sub region in which there is no internal gap in that sequence is removed from the alignment the alignment adjusted to close any internal gap found in all the remaining sequences and the removed sequence is realigned MultiPipMaker uses the BLASTZ alignment scoring system to score nucleotide substitutions for all pairwise and multiple alignments This matrix was optimized for human mouse comparisons and so may not be optimal for comparisons of sequence from other organisms Chiaromonte et al 2002 Most of the programs use the same scoring matrix and gap penalties for all organisms though species specific ones can be implemented Brudno et al 2004 As more analysis of available genomes is done more realistic scoring tools modeling gap distribution e g can be developed Kent et al 2003 MultiPipMaker requires that only the reference sequence be finished quality All other sequence can be provided as draft quality in unordered or unoriented contigs MultiPipMaker projects other sequences onto the reference sequence which sequence is chose as the reference will affect the resulting alignment 2 Multi LAGAN MLAGAN Multi Limited Area Global Alignment of Nucleotides accessible as a webbased server http lagan stanford edu uses a progressive alignment strategy to construct a multiple alignment which can then be improved using an iterative refinement strategy Brudno et al 2003b MLAGAN begins by
View Full Document
Unlocking...