UMD CMSC 858W - Multiple sequence alignment - D2164251

Home> Schools> University of Maryland, College Park> Computer Science (CMSC) > CMSC 858W> Multiple sequence alignment

UMD CMSC 858W - Multiple sequence alignment

School name University of Maryland, College Park

Course Cmsc 858w- Advanced Topics In Theory Of Computing: Algorithms For Biosequence Analysis

Pages 6

Download Save

Unformatted text preview:

Multiple sequence alignmentIn biology we are frequently faced with the problem of aligning multiple sequences together, e.g. we have managed to extract the sequence of a gene of interest from multiple organisms, and want to find out how these different versions of the sequence relate to each other. Specifically, we want to create a matrix where each sequence is represented as a row, and each column represents the alignment of characters from the original sequences, also including gap characters as necessary. Note: we generally require that at least one character in each column is not a gap.Seq1 ACC-AGTGA-CTSeq2 -CGTAGCGA-CTSeq3 ACC-AGTGT-CTSeq4 ACGTAGTCATCTUnlike the pairwise sequence alignment problem, the multiple alignment problem is a bit harder to formalize. A commonly used theoretical framework states the problem as follows:Multiple Alignment Problem. Given a set of n sequences s1 .. sn, find the multiple alignment M that minimizes the sum of the pairwise distances between these sequences: ∑si, sj∈SdM si, sjdM(si, sj) is the distance between sequences si and sj as implied by the multiple alignment M, i.e. the sum of the distances between the characters of the two sequences that are aligned to each other in the multiple alignment. Generally, alignments between gap characters are assigned score 0. The score of alignments between individual nucleotides or amino-acids depends on the specific context: in protein alignments the scores can be derived from the BLOSUM matrix scores, while in phylogenetic analysis of nucleotide sequences, the alignment scores are determined by an appropriately defined evolutionary model.Note that dM(si, sj) is generally larger than D(si,sj) - the optimal pairwise distance between two sequences (as defined by the pairwise alignment problem discussed earlier in the class).The Multiple Alignment Problem can be "easily" solved through an extension of the dynamic programming algorithm for pairwise sequence alignment. The intuition is to extend the V(i,j) values from the 2-sequence context, to V(i, j, k) for 3 sequences, etc. The resulting algorithm runs in O(Ln) where L is the length of the aligned sequences, and n is the number of sequences. Clearly, this algorithm is impractical, though some run-time improvements can be obtained by carefully pruning the dynamic programming table. In general, the Multiple Alignment Problem is NP-hard, leading to the need for heuristic approximation algorithms.One such algorithm is called Star Alignment - deriving its name from the fact that the multiple alignment is constructed by aligning all the sequences to a same "center" sequence. The alignment process is called "progressive alignment" and proceeds as follows:1. Construct alignment of sequence 1 with the center sequence (using standard pairwise alignment)2. Construct alignment of sequence 2 with center sequence (using standard pairwise alignment). Any gaps inserted within the center are propagated to the already constructed alignment with sequence 1.3. repeat 2 until all sequences have been alignedNote: the implementation can be a bit tricky as you need to avoid having spurious gaps inserted in the alignment - e.g. if two sequences both lead to the insertion of a gap at the same location in the center sequence, the alignment algorithm should ensure a single gap is inserted (rather than two adjacent gaps).Choice of the center sequence. For reasons that will become apparent below, the center sequence sc is chosen to be the sequence that minimizes the equation∑i≠cD si, scover all possible choices of sc.Theorem: If the distance between sequences satisfies the triangle inequality, the star alignment has a score at most 2 * OPT, where OPT is the score of the optimal multiple alignment.Proof:Let S* be the "sum of pairs" score of the star alignment.Sstar=∑i ≠ jdstarsi, sj≤∑i≠ j D si, scD sj, sc=2 n−1∑iDsi, scwhere n is the number of sequences and dstar is the distance implied by the star alignment.Let OPT be the score of the optimal alignment.OPT =∑i≠ jdOPTsi, sj=∑j∑idOPT si, sj≥∑j∑iDsi, sc=n∑iD si, scCombining the two equations above we get SstarOPT≤2−2n2Tree-guided alignmentNeither the optimization problem implied in the definition of the multiple alignment problem, nor the approximation provided by the star alignment have a biological interpretation. As a corollary, the resulting multiple alignments may not capture interesting biological phenomena. A more "biologically meaningful" approach to multiple alignment relies on the construction of a guide tree that, intuitively, captures the evolutionary relationship between the sequences being aligned. The guide tree stores the sequences at its leaves, and the progressive alignment approach described in the context of the star alignment can be modified to allow sequences to be merged together in a bottom-up fashion into a single multiple alignment. Virtually all "popular" multiple alignment programs rely on this paradigm. Two heavily usedprograms are ClustalW and Muscle (papers linked from the class website).Multiple alignment of genomesThe problem: so far we've discussed global multiple alignment - the sequences have to be aligned end-to-end. This approach simply doesn't scale to whole genomes. Also, whole genomes have rearrangements, events that are not captured in the typical edit-distance alignment algorithms. Example: the Mauve system - http://asap.ahabs.wisc.edu/mauve/Basic idea: 1. Find a collection of MUMs shared by 2 or more genomes (multi-MUMs). Can think of a variety of ways of doing this, but in principle the basic idea is to first find exact k-mer matches across multiple genomes (making sure none is a repeat in any of the genomes) then extend these matches along groups of genomes as long as the corresponding sequences agree.2. Find locally collinear blocks (LCB) among the MUMs - these are sets of MUMs that occur in the same order in all genomes. Note: the LCB algorithm requires all MUMs to occur in all genomes so it's only applied to those MUMs. The idea of the LCB algorithm is to repeatedly sort the MUMs based on their occurrence in each genome, and mark breakpoints in this sorting - places when the MUMs sorted by genome i occur out of order in genome j. 3. A global guide-tree is constructed from all genomes, using the weight of the shared MUMs as a similarity/distance

View Full Document


School:
Email:
New Password:
Confirm Password:

UMD CMSC 858W - Multiple sequence alignment

Sign up for free to view:

Please select your school