Unformatted text preview:

1Multiple Sequence Alignment  Generalization of two sequence similarity problem, the problem of determining the similarity among multiple sequences.  The purpose is to discover ‘faint but widely dispersed’ common sequences which might represent biologically important information.  These common sequences might reveal evolutionary history, conserved motifs in the genome of divergent species, common chemical structure that give rise to similar folding or 3-D structures of proteins giving rise to similar functions. Biology Applications An example is the notion of protein familywhich is a collection of proteins having  similar 3-D structure,  similar functions, and similar evolutionary history.  If a new protein is discovered and if one is interested in classifying which family it belongs, comparison with individual members in the family might produce conflicting or confusing results.2Multiple Alignment of Several Amino Acid Sequences of Globin Proteins The example below shows how common features are dispersed faintly among a group of proteins which may not be apparent when two sequences in the family are compared. The abbreviations on the left denote the organisms that the globin sequences are from. The sequences are displayed in several rows since they are longer than a page can accommodate. Columns containing highly similar residues in regions of known secondary structures are marked by “v” and columns with identical residues are marked by *. Two residues are considered similar if they are from any one of the folowing classes: (F,Y), (M,L,I,V), (A,G),(T,S),(Q,N),(K,R) and (E,D).34Family MembershipIf the faint similarity of the members in the family can be represented by what is called a ‘consensus sequence’, it will be more efficient to find an alignment of the new protein with this consensus sequence to determine whether it belongs to this family.DefinitionGiven sequences , a multiple(global) alignment maps them to sequences that may contain spaces, whereand the removal of all spaces from leaves , for kSSS ,....,21kSSS''2'1,....,|,|,....||||''2'1kSSS ==='iSiS.1 ki ≤≤5Multiple Alignment Although the generalization of definition from two sequences to multiple sequences seems straightforward, it is not that obvious how to scoreor assign value to a multiple alignment.  There are various scoring methods such as sum-of –pairs (SP) functions, consensus functions, and tree functions.  For the sake of mathematical ease, SP functions have been widely used and good approximation algorithms have also been developed. Multiple alignment Objective score: Sum-of-pairs (SP) Sum of objective score for alignment of each pair of sequencesSEQVENCESDQVE-CRTEQVEACESP( )=SEQVENCESDQVE-CRScore() +SEQVENCETEQVEACEScore() +SDQVE-CRTEQVEACEScore()6Multiple Alignment with Sum-of-Pairs Definition:  Given a global multiple alignment A of k sequences, the sum-of-products value of A is the sum of the values of all pairwisealignments induced by A. Definition:  The score of an induced pairwise alignment could be any chosen scoring scheme for two string alignment in the standard manner.⎟⎟⎠⎞⎜⎜⎝⎛2kDefinition We also assume that the pairwise scoring function is symmetric. We will not consider gap penalty for this discussion.  We use the edit distance function of two sequences as the induced pairwise metric. We use the symbol δ(x,y) to denote the distancebetween two characters x and y which may include space characters.  For two strings, our objective will be to minimize where .))(,)(('1'qSqSjlqi∑=δ||||''jiSSl ==7Example Consider the following alignment S1: A C -- C T G -- S2: -- C -- A T G T S3: A -- G C T A T Using the distance function δ(x,x)=0, and δ(x,y)=1 for x≠y, we have 3, 4 and 5, giving a total of sum-of-pair value 12. =),(2,1SSd=),(3,1SSd=),(32SSd Definition:  An optimal SP global alignment ofsequences S1,S2…,Skis an alignment that has the minimum possible sum-of-pairs value for these k sequences.8Dynamic Programming Formulation to Compute Optimal Multiple Alignment The dynamic programming method for two sequences has a natural generalization for the multiple sequence case.  Instead of a 2-dimensional matrix, we need a k-dimensional matrix with n+1 ‘rows’ in each dimension, giving a total of (n+1)kentries, each entry depending on adjacent 2k-1 entries.  This neighborhood corresponds to the possibilities for the last match in an optimal alignment: any of2k-1 non-empty subsets of the k sequences can participate in that match. For two sequences, we had three possibilities: both the last characters were actual characters from the two sequences, or one space, and the other an actual character (two possibilities).  Gusfield gives a complete description of the algorithm for three sequences. The details for an arbitrary number of sequences is simply an exercise in developing appropriate notations and is left out.9Optimize SP for N sequences Similarity matrices become N-dimensional E.g., for 3 sequences are cubesM[i,j,k] = score of best alignment offirst i letters in Afirst j letters in Bfirst k letters in CijkVery slow Because each of the (n+1)kentries can be computed in time proportional to 2k, the running time of the algorithm is O((2n)k). If n=200, the algorithm may be practical for only up to k=3 or 4. We want the algorithm to run for k=100 or more. Is NP-complete Wang, L. and Jiang, T. (1994) On the complexity of multiple sequence alignment. J Comput Biol 1(4): 337-48. Totally impractical for most biologically interesting problems. Faster methods needed.10Center Star Alignment Algorithm Since the optimal SP alignment problem is NP-complete, we need, approximate algorithms.  Gusfield proposed such an algorithm, called Center Star Alignment Algorithm whose SP values are less than twice the optimal value. We sketch this algorithm now.Center Star Alignment Algorithm We make the following assumptions about the distance function induced by an alignment obtained by the algorithm,d(S1,S2):δ(x,x)=0, for all characters x. Symmetric: δ(x,y)= δ(y,x), and d(S1,S2) = d(S2,S1) Triangle inequality:for all characters x, y and z. Cost of x to z is no more than cost of x to y and then y to z. Consequently, d(X,Y) ≤ d(X,Z) + d(Z,Y) for all sequences X,Y and Z We have used the symbol to denote the


View Full Document

UCF CAP 5937 - Multiple Sequence Alignment

Documents in this Course
Load more
Download Multiple Sequence Alignment
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Multiple Sequence Alignment and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Multiple Sequence Alignment 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?