Unformatted text preview:

1Integrative Biology 200A “PRINCIPLES OF PHYLOGENETICS” Spring 2006 University of California, Berkeley Kipling Will- 23 Feb 2006Alignment Similarity: Two or more sequences (bases, amino acids, proteins, etc.) are matched in a Pairwise alignment eitherglobally (two sequences matched over their whole length) or locally (some subset of the sequences matched whileother regions are not expected to match). Sequence similarity can simply be a mathematical distance between twosequences given events such as insertions, deletions and substitutions.In the simplest model this is the “Edit distance” or the minimal number of events required to transform one sequenceinto another.Example to go from acctga to agcta: accgta <<[substitution]>> agctga <<[deletion]>> agcta The edit distance = 2.BLAST (Altschul, SF, W Gish, W Miller, EW Myers, and DJ Lipman. Basic local alignment search tool. J Mol Biol 215(3):403-10, 1990).For example, a gene is newly identified and function understood in Drosophila, a researcher can BLAST thedatabase of the human genome to look for similar gene sequences.Very basic description of BLAST1. Uses short segments of sequence to find other sequences that contain the same set.2. Does “ungapped” alignment extending from the matched subsequence regions to find high-scoring matches3. Does a rapid gapped alignment to select and rank close matchesHomology: Establishing an initial estimate of homology (basically similarity) is essential. Unaligned sequence datahas no a priori base homology. As a consequence, the fixed alignment, achieved by one method or another, is treatedas prior, or background knowledge. Recall the hierarchy of characters and state and that only the states are reallytested in the analyses.The outcome of the of optimality based tree searching, especially parsimony, is strongly influenced by thealignment. Practical issues:Dynamic Programming and global alignment: (Needleman-Wunsch) underlies or is part of most alignment methods.Check out tutorial at http://www.sbc.su.se/~pjk/molbioinfo2001/dynprog/dynamic.html A C T A G C T XA S 1 A 0 1 1 1 2G 2 3 G 1 0 1 1 2C C 1 1 0 1 2 T 1 1 1 0 2 X 2 2 2 2 01=cost 2 A-C....AGC2=cost 2 ACTA-G....3=cost 1 ACTAGCFor two sequences, i.e. pairwise alignment, of length n, if no gaps are allowed then there is one optimal alignment. Ifgaps are allowed, i.e. there is sequence length variation, then...(2n)!/(n!)2 e.g. n=50 then 1029 alignments. Enumeration is not an option!2Two problems- how to find alignments and how to choose.Taxon1 ACTTCCGAATTTGGCTTaxon2 ACTCGATTGCCTMinimize substitutions-ACTTCCGAATTTGG-CT||| ||| ||| ||ACT--CGA--TTG-CCTMinimize ind/delsACTTCCGGAATTTGGCT|||* **|||*||ACTC-----GATTGCCTWe need heuristic searches based on Optimality and scoring.Alignment really attempts to balance the amount of indels with the amount of base substitution, normally based onsome cost differential. Of course it is possible to account for all differences by inserting enough gaps (trivialalignment).For phylogenies, pairwise comparison is not sufficient. What must be done is multiple sequence alignment, aglobal solution for the whole data matrix or primary homology for the characters (columns) in the matrix.Various methods have been used to do this. Here are some.....Manual or By eye- For very simple data this may be sufficient, however, it violates any criterion of repeatability asthere is no obvious costs matrix. The counter argument is that the aligned matrix can be made available. However,what if I want to add or subtract OTUs? This would influence the alignment, but how? This is subject to individualpattern recognition abilities for thousands of bases and hundreds of sequences. It is also likely to increase the numberof editing errors because of additional "handling" of sequences.>>Manual alignments informed by consideration of secondary structure-1. Does not solve the problem of nucleotide homology. At best it places constraints on changes by establishingputative limits between loop and stem regions. Nucleotides within each of those units must still be homologized andall the problems still apply. 2. Determination of secondary structure is not simple and not unambiguous. Generally the actual pattern of bondingis probabilistic and depends on the minimization of free energy and the thermodynamic stability of the resultingstructure. Programs explicitly designed to model secondary structure are not very realistic (yet) in terms of the actualcell environment and might find multiple, equally probable models. In phylogenetic studies, secondary structure istypically inferred by aligning with a sequence of “known” secondary structure, although the basis of that knowledgeremains uncertain and applicability to the study taxa is unclear in many cases, but this is heading in the rightdirection. 3. There might be reasonable to expect selective pressures to apply to secondary structure interactions (that is,requirements of compensatory changes), it is unclear just how relevant those interactions are compared to selectivepressures applied at other structural levels.>>Purging "bad" data or scoring variable regions as single characters.Another method frequently used get around problems in hard to align sections is the elimination of gap heavyregions in alignments. Exactly which columns should be eliminated (left-right boundaries) is subjective andobviously they may have an impact on the results (otherwise why bother).3Alternatively, the variable region can be converted into a character in each taxon and scored. This has all theproblems above and adds another layer of difficulty in determining how to code the states.Simultaneous alignment- Simultaneous multiple alignments synchronise the information of all input sequences in ahyperspace lattice, e.g. so-called exact alignment algorithms using the divide-and-conquer (DCA) strategy (Tönges, U.,Perrey, S.W., Stoye, J. and Dress, A.W.M. 1996. A general method for fast multiple sequence alignment.Gene 172GC33-GC41). In part it cutsdown the input sequences at carefully chosen positions to align in segments. Current algorithms cannot handlelarge/complex data sets.Progressive alignment- As in Clustal W(X) the most prominent program for progressive alignment strategies. 1. All sequences are compared to each other (pairwise alignments) 2. A dendrogram is constructed, describing the approximate groupings of the sequences by


View Full Document

Berkeley INTEGBI 200A - Alignment

Documents in this Course
Quiz 1

Quiz 1

2 pages

Quiz 1

Quiz 1

4 pages

Quiz 1

Quiz 1

5 pages

Quiz 2

Quiz 2

4 pages

Quiz 1

Quiz 1

2 pages

Quiz 1

Quiz 1

2 pages

Notes

Notes

3 pages

Quiz 2

Quiz 2

3 pages

Load more
Download Alignment
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Alignment and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Alignment 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?