Contents 1 Comparisons of Long Genomic Sequences Algorithms and Applications Michael Brudno and Inna Dubchak 1 1 1 Comparisons of Long Genomic Sequences Algorithms and Applications 1 1 Introduction 1 2 Local Alignment Seed generation Joining of neighboring seeds Ungapped and gapped extension 1 1 1 2 1 3 Global alignment 1 5 Finding potential anchors Building a consistent set of anchors Filling in the gaps 1 4 Multiple Global Alignment 1 7 Scoring a multiple alignment MGA and DIALIGN alignment algorithms Progressive alignment 1 5 Alignment of Sequences with Rearrangements Pairwise alignment with Shuffle LAGAN alignment with Mauve 1 9 Multiple 1 6 Whole genome alignment 1 11 Local alignment on a whole genome scale tandem approach Local global 1 7 Visualization 1 13 Michael Brudno University of Toronto Inna Dubchak Lawrence Berkeley National Laboratory 1 1 Visualization of pairwise alignments Visualization of multiple alignments Whole genome visualization of alignments 1 8 Applications of alignments 1 17 1 9 Conclusion 1 18 1 10 Acknowledgments 1 19 Introduction Comparing genomic sequences across related species is a fruitful source of biological insight because functional elements such as exons tend to exhibit significant sequence similarity due to purifying selection whereas regions that are not functional tend to be less conserved The first step in comparing genomic sequences is to align them that is to map the letters of one sequence to those of the others There are several categories of alignments local alignments identify local similarities between regions of each sequence global alignments find a mapping between all the letters of the sequences Alignments can be either pairwise between two sequences or multiple that compare several sequences The main challenge in developing algorithms for genomic alignment is that these must be fast enough to deal with megabase 0 8493 8597 0 01 0 00 1 50 c 2001 by CRC Press LLC 1 1 1 2 long sequences and gigabase long genomes but also accurately map individual base pairs While generating alignments is difficult computationally visualization of alignments also presents challenges such as how to enable users to interact with the data and the processing programs in the context of enormous datasets Visualization frameworks should be easy to understand by a biologist and provide insight into the mutations that a particular region has undergone Finally alignments are useful only if they help shed light on the important functional elements in the genomic sequence In this chapter after a detailed discussion of algorithms used to construct genomic alignments and methods to visualize them we give a short overview of several algorithms that use an alignment to improve predictions of transcription factor binding sites 1 2 Local Alignment Local alignment is the basic problem of finding similar fragments in two sequences regardless of the order and location of these similarities Consequently local alignments allow one to identify rearrangements between two sequences and are suitable for aligning draft sequences The original local alignment algorithm is the Smith Waterman 62 dynamic programming approach This algorithm however runs in time proportional to the product of the lengths of the sequences As this is impractical for comparing two long genomic intervals there has been extensive work since the mid 1980s on development of fast approaches for local alignment of genomic sequences While the details of all of these approaches are different most share some overriding paradigms In particular almost all algorithms start with seed generation the location of short exact or nearly exact matches between two sequences Because this can be accomplished quickly by indexing one of the two sequences in an appropriate data structure such as a lookup table or some variant of a suffix tree these seeds help to reduce the search area of the local alignment algorithm to just the regions that are likely to be similar Once generated nearby seeds may be joined together the presence of several seeds close to each other is a stronger evidence of homology than a single seed Finally the individual seeds or groups of seeds are extended to find regions that did not match exactly but are still conserved These three steps form the basis of most local alignment algorithms for DNA sequences the individual algorithms differ in how they solve each of the steps 1 2 1 Seed generation Perhaps the simplest way to generate seeds between two sequences is a straight lookup table technique all k long words k mers of one sequence the database are indexed in a table and the k mers of the other sequence the query are used to retrieve from the lookup table the locations at which the particular k mer of the query is present in the database sequence This approach was used in the two first and perhaps the best known local aligners for long sequences FASTA 53 and BLAST 1 2 An alternative approach for seed generation is to use a suffix tree or one of its variants such as Aho Corasick automaton to search for the seeds This approach suffers from higher constant overhead in terms of both the memory requirements and the running time but allows for the use of longer k mer matches as it takes advantage of sparseness of longer seeds there are 4k possible DNA words of length k and only a small fraction of them may be present in a particular sequence This makes suffix tree approaches preferable when one is searching for longer seeds while direct lookup tables are preferable for shorter ones Instead of searching for k long words that match exactly between the two sequences it is possible to use degenerate words as seeds that is words Comparisons of Long Genomic Sequences Algorithms and Applications 1 3 where a certain number of the letters are allowed not to match These seeds allow for a higher sensitivity than exact matching k mer seeds but are more computationally intensive to generate These seeds can be found in one of several ways all of which however lead to an exponential running time increase in the number of degeneracies it is possible to create indices where all of the possible degenerate positions are absent i e index all possible 7 bases within an 8 mer This leads to each word being represented only once in an index but the number of indices grows as nk where n is the number of degeneracies allowed Alternatively one can index each word at all of its degenerate locations This way there is only one index
View Full Document