Stanford CS 374 - Multiple Sequence Alignment - D3007518

Home> Schools> Stanford University> Computer Science (CS) > CS 374> Multiple Sequence Alignment

DOC PREVIEW

Stanford CS 374 - Multiple Sequence Alignment

School name Stanford University

Course Cs 374- Algorithms in Biology

Pages 10

This preview shows page 1-2-3 out of 10 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 10 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 10 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 10 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 10 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Multiple Sequence AlignmentIntroductionBackgroundResearch presented on the papersMultiple Sequence Alignment CS374 Fall 2006 Lecture 25, 12/7/06 Lecturer: Sarah Aerni Scribe: Bahman Bahmani Multiple Sequence Alignment Based on the following papers: 1. Chuong B. Do, Mahathi S.P. Mahabhashyam, Michael Brudno, and Serafim Batzoglou, “Prob-Cons: Probabilistic consistency-based multiple sequence alignment”, Genome Research, 15:330-340, 2005. 2. Ari Loytynoja, and Nick Goldman, “An algorithm for progressive multiple alignment of se-quences with insertions” , PNAS 30:10557-10562, 2005. Additional references: 1. R. Durbin, S. Eddy, A. Krogh, G. Mitchison, “Biological sequence analysis”, Cambridge Univer-sity Press, Cambridge, UK, 1998. 2. http://bioalgorithms.info 1. Introduction The problem of sequence alignment is to find the patterns of sequence conservation and similarity between pairs or sets of given sequences. In biological contexts, similarity be-tween biological sequences usually amounts to either functional or structural similarities or divergence from a common ancestor. Therefore, it is not strange that sequence align-ment is one of the central tools in molecular biology.Multiple Sequence Alignment CS374 Fall 2006 Lecture 25, 12/7/06 Lecturer: Sarah Aerni Scribe: Bahman Bahmani 2. Background 2.1. Pair-wise Alignment The simplest case of sequence alignment is aligning a pair of sequences (the so-called pair-wise alignment). The classic methods to solve this problem utilize dynamic pro-gramming to optimize a score for the alignment calculated by adding the substitution costs for aligned pairs of positions and gap penalties for unaligned residues. The first such algorithm was presented by Needleman and Wunsch in 1970. The early methods used gap penalties proportional to the length of the gap. However, the results were not satisfactory because either the gap costs were so high that long gaps never appeared in the alignment or they were so small that the alignment became frag-mented by many short gaps. But, in real-life there gaps appear with very different lengths. Consequently, affine gap penalties were proposed where gap opening and gap extension costs were separated. Indeed, gap opening is usually much more costly than gap exten-sion, so it is assigned a higher penalty. As mentioned above, the classic methods use dynamic programming to find the pair-wise alignment between two sequences. The following is a simple instance of such an algo-rithm: Where si,j is the score of aligning the sequences v1,…,vi and w1,…,wj. Indeed, the formula above recursively finds of the best alignment between the above sequences obtained by either the extension of sub-alignment v1,…,vi-1 and w1,…,wj-1 by a match or the extension of v1,…,vi-1 and w1,…,wj or v1,…,vi and w1,…,wj-1 by a gap. The result can also be graphically represented as follows:Multiple Sequence Alignment CS374 Fall 2006 Lecture 25, 12/7/06 Lecturer: Sarah Aerni Scribe: Bahman Bahmani Where a diagonal movement represents a match, a horizontal movement represents an in-sertion in the horizontal sequence, and a vertical movement represents an insertion in the vertical sequence. Finally, it can be shown that the time and space complexity of such algorithms are O (L2). 2.2. Multiple Alignments Multiple sequence alignment can be considered as a generalization of the pair-wise alignment where instead of just two sequences, we have a set of sequences (possibly with more than two members) to align. However, this problem is much harder both conceptu-ally and computationally. Indeed, similar to the case with two sequences where the complexity of finding the solu-tion was O (L2), in the more general case of multiple alignment with N sequences the complexity of finding the solution is O (LN) which is exponential in N, and hence it is not a computationally feasible task when N is a large number. For instance, if the computa-tion of the solution takes 102N-4 seconds, then while it takes respectively only 1 and 10 seconds for the cases with 2 and 3 sequences, it actually takes 3 trillion years for the case with 12 sequences!! Therefore, heuristic methods have been proposed to approximately solve this problem. The most popular such method so far is the progressive multiple alignment. Briefly speaking, in the progressive multiple alignment method, first pair-wise align-ments are computed for all possible pairs of sequences in the given set. Then these pair-wise alignments are utilized to construct a binary tree called guide tree somehow repre-senting the phylogenetic tree connecting the sequences. Finally, using this tree, first the most similar sequences (which are in the leaves of the tree) are aligned, and then the ob-tained profiles are aligned sequentially along the inner nodes until an alignment of all the sequences is reached in the root of tree. The following figures show a sample of such process:Multiple Sequence Alignment CS374 Fall 2006 Lecture 25, 12/7/06 Lecturer: Sarah Aerni Scribe: Bahman Bahmani 3. Research presented on the papers 3.1. PronCons: Probabilistic consistency-based multiple sequence alignment The first paper called was written by Chuong B. Do et al. It presents a method called ProbCons which is a protein progressive multiple alignment tool. Fundamentally, it is a pair-hidden Markov model-based progressive alignment algorithm that primarily differs from most typical approaches in its use of maximum expected accuracy rather than Viterbi alignment, and of the probabilistic consistency transformation to incorporate mul-tiple sequence conservation information during pair-wise alignment. The method can be described as follows: Step1. Posterior probability matrices ProbCons uses the HMM shown below to specify the probability distribution over all alignments between any pair of sequences: Let x and y be two proteins represented as two sequences where xi is the i-th amino acid of x. Assume A is the set of all possible x-y alignments. An alignment a uniquely corre-sponds to a sequence of state-emission pairs <s1,o1>, …, <sn,on>. Then, the probability of a is:Multiple Sequence Alignment CS374 Fall 2006 Lecture 25, 12/7/06 Lecturer: Sarah Aerni Scribe: Bahman Bahmani −nna1|(P Where: Assume a* is the (unknown) alignment in A that best represents the biologically true lignment of x and y. Then we interpret P(a|x,y) as the probability (or

View Full Document