Berkeley INTEGBI 200A - Alignment- an issue of homology - D1959377

Home> Schools> University of California, Berkeley> Integrative Biology (INTEGBI) > INTEGBI 200A> Alignment- an issue of homology

DOC PREVIEW

Berkeley INTEGBI 200A - Alignment- an issue of homology

School name University of California, Berkeley

Course Integbi 200a- Principles of Phylogenetics: Systematics

Pages 8

This preview shows page 1-2-3 out of 8 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 8 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 8 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 8 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 8 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

1Integrative Biology 200A “PRINCIPLES OF PHYLOGENETICS” Spring 2010 University of California, Berkeley Will- 16 Feb 2010 Alignment- an issue of homology: As we have discussed previously, establishing an initial estimate of homology (i.e. primary homology or conjectural homology) is essential for all types of characters. Primary homology involves assessment of similarity that includes evaluation of topographical identity and state identity. However, I disagree with the Brower and Schawaroch paper (1996, one of your readings) that it is necessary or even possible to separate the implication of homology from the evaluation of these two aspects of similarity. If phylogenetic analysis is your intention, then here is no other reason to bother making a statement of similarity. On the other hand, it is important to recognize the difference between establishing the columns in the matrix (i.e. characters) via topographical identity, which are assumed background knowledge based on previous analyses, and establishing state identity (character states), which are subsequently tested by congruence. It is interesting and important to note that topographical identity for morphological characters is often (but certainly not always!) uncontroversial and state identity much more frequently problematic. For sequence data the opposite is usually the case, i.e. state identity (A,C,G,T) is a given but what are the columns (character or topographical identity) can be problematic. The methodology of PCR and sequencing helps to establish broad-scale topographic identity by presumed primer specificity that results in a single product that is assumed to be homologous and orthologous. When this fails, often (we hope) alignment or phylogenetic analysis will show symptoms of this. Nuclear pseudogenes or other non-functional paralogs are often degenerated (for example they may include stop codons in an open reading frame) or otherwise are highly differentiated and are difficult to align. But we certainly can be fooled. Assuming homology of the gene or region bound by conserved primer binding sites is not usually too problematic, however, in variable length regions, particularly in non-coding regions, establishing an alignment is very problematic. Typically a fixed alignment, achieved by one method or another, is treated as prior, or background knowledge to the phylogenetic analysis. In most cases the outcome of the phylogenetic analyses are influenced by the alignment in terms of topology and/or support values. Note: In the earlier molecular literature you may see the term “percent homology”, which is an incorrect use of the term homology. The correct way to refer to the difference/similarity of two sequences is percent similarity or percent identity. Alignment, Pairwise, local and global: Two sequences (strings of bases, amino acids, proteins, etc.) are matched in a pairwise alignment either globally (two sequences matched over their whole length) or locally (some subset of the sequences matched while other regions are not expected to match). Local pairwise comparison is very useful in finding partially highly similar regions in a larger query sequence. The sequences and the local residues compared may or may not prove to be homologous. This is a strategy often used in bioinfomatic applications such as database searches. A common and powerful example is of local pairwise alignment is BLAST (Altschul, SF, W Gish, W Miller, EW Myers, and DJ Lipman. Basic local alignment search tool. J Mol Biol 215(3):403-10, 1990). This is a very fast way to sift through a very large database of sequences if, for instance, a gene is newly identified and function understood in Drosophila, a researcher can BLAST the database of the human genome to look for similar gene sequences. Very basic description of BLAST 1. Uses short segments (“words”) of sequence to find other sequences that contain the same set. 2. Does “ungapped” alignment extending from the matched subsequence regions to find high-scoring matches 3. Does a rapid gapped alignment to select and rank close matches Global pairwise alignment establishes overall sequence similarity usually by calculating a mathematical distance, i.e. minimum edit distance, between two sequences being compared. The alignment attempts to balance the number of indels (gaps) with the amount of base substitution, normally using some cost differential. It is possible to account for all differences between the pair by inserting enough gaps (trivial alignment), but this would be uninformative and2unrealistic. In the simplest model the “Edit distance” is the minimal number of events required to transform one sequence into another using some scheme of insertions, deletions and substitutions. Go from acctga to agcta: accgta <<[substitution]>> agctga <<[deletion]>> agcta The edit distance = 2. OTU1 ACTTCCGAATTTGGCT OTU2 ACTCGATTGCCT Minimize ind/dels Minimize substitutions ACTTCCGGAATTTGGCT ACTTCCGAATTTGG-CT |||* **|||*|| ||| ||| ||| || ACTC-----GATTGCCT ACT--CGA--TTG-CCT Dynamic Programming and global alignment: (Needleman-Wunsch) underlies or is part of most alignment methods. Check out tutorial at http://www.avatar.se/molbioinfo2001/dynprog/dynamic.html Global multiple alignment: Two problems- how to find alignments and how to choose. For phylogenies, pairwise comparison is not sufficient. What must be done is multiple sequence alignment, a global solution for the whole data matrix or primary homology for the characters (columns) in the entire matrix. Practical issues. For two sequences, i.e. pairwise alignment, of length n, if no gaps are allowed then there is one or few optimal alignment(s). If gaps are allowed, i.e. there is sequence length variation, then... (2n)!/(n!)2 e.g. n=50 then 1029 alignments. For global multiple alignment, where N= the number of sequences, an N-dimensional matrix implementing the dynamic programming is need. Enumeration is not an option! We need heuristic searches based on optimality and scoring. Various methods and programs have been/are used to tackle this problem. Here are some..... >>Manual, by hand or by eye- For very simple cases it may be sufficient to simply look at the matrix and make adjustments. This is not problematic for aligning the ends of coding sequences or when a “known” reference sequence is used. However, for complex

View Full Document