Unformatted text preview:

MIT OpenCourseWare http://ocw.mit.edu 6.047 / 6.878 Computational Biology: Genomes, Networks, EvolutionFall 2008 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.6.047/6.878 Lecture 2 - Sequence Alignment and Dynamic Programming September 15, 2008 1 Introduction Evolution has preserved functional elements in the genome. Such elements often manifest themselves as homologues, or similar sequences in descendants of a common ancestor. The two main types of homologous sequences are orthologous and paralogous sequences. Orthologous sequences typically have similar functions in both the ancestor and the descendant and arise through speciation events, while paralagous sequences arise from common ancestors through gene duplication. Furthermore, paralagous genes imply a common ancestor but, due to mutation and evolution the functionality of that particular gene has shifted considerably. We will mostly be interested in studying orthologous gene sequences. Aligning sequences is one of the main ways of discovering such similarities between different ancestors. And in solving sequence alignment problems, the primary computational tool will be Dynamic Programming. 1.1 An Example Alignment Within orthologus gene sequences, there are islands of conservation which are relatively large stretches of nucleotides that are preserved between generations. These conserved elements typically imply functional elements and vice versa. Although, note that conservation is sometimes just random chance. As an example, we considered the alignment of the Gal10-Gal1 intergenic region for four different yeast species, the first whole genome alignment for crosspieces (Page 1 Slide 5). As we look at this alignment, we note some areas that are more conserved than others. In particular, we note some small conserved motifs such as CGG and CGC, which in fact are functional elements in the binding of Gal4. This example illustrates how we can read evolution to find functional elements. Figure 1: Sequence alignment of Gal10-Gal1. 11.2 Genome Changes are Symmetric The genome changes over time. In studying these changes, we’ll confine ourselves to the level of nucleotide mutations, such as substitutions, insertions and deletions of bases. Lacking a time machine, we cannot compare genomes of living species with their ancestors, so we’re limited to just comparing the genomes of living descendants. For our purposes, time becomes a reversible concept. This means that we can consider the events that change a genome from one species to another as occurring in reverse order (tracing up an evolutionary tree towards a common ancestor) or forward order (tracing downwards from a common ancestor). We consider all DNA changes to be symmetric in time: an insertion in one direction is equivalent to a deletion in the other. Furthermore, since mutations and genomic changes are independent of each other, it is irrelevant to consider the “direction” of the branches in the evolutionary tree. Therefore to study the common ancestor of two different species, we need only consider the intermediate stages in a mutation process from one of the descendants to the other. In doing so, we can avoid the problem of not having the ancestral nucleotide sequence. 2 Problem Formulations: Biology to CS We need to translate the biological problem of sequence alignment into a computer science problem that can be attacked by computational tools. In creating a model to represent the problem, we need to balance biological relevance with computational tractability. 2.1 Goal of Alignment The goal of alignment is to infer the ’edit operations’ that change a genome by looking only at the endpoints. We define the set of operations evolutionary operations to contain insertions, deletions, and mutations. And since there are many possible sequences of events that could change one genome to the other, we also need to establish an optimality criterion, e.g. minimum number of events or minimum cost. In reality, insertions and deletions are much less likely than substitutions, and so we would prefer to attribute a different cost to each of those events to make solution sequences prefer substitutions. The algorithm we show will assume equal costs for each operation; however, it can be easily generalized for varying cost. 1 2 2.2 Brute-Force Solution Given a metric to score a given alignment, the simple brute-force algorithm just enumerates all possible alignments, computes the score of each one, and picks the alignment with the maximum score. How many possible alignments are there? If you consider only NBA3 and n ≈ m, the number of alignments is 4 � n + m � (n + m)! (2n)! � 4πn) (2en)2n 22n m = n!m! ≈ (n!)2 ≈ (√2πn nn 2)n 2 = √πn (1) en For some small values of n such as 100, the number of alignments is already too big (≈ 1060) for this enumerating strategy to be feasible. 1Even different substitution events have different likelihood since specific polymerases are more or less likely to make specific mistakes. 2There was a note about evolutionary pressure to keep the genome compact. For example, the genome of viruses is very small since they have the evolutionary pressure of fitting it inside the small virus cell. Similarly, bacteria are pressured to keep their genome small to save energy in replication and to allow easy exchange of genomic material between cells. On the other hand, there is little pressure for human cells to maintain a small genome, so it is no surprise that the human genome is so large. 3Non Boring Alignments, an alignment that doesn’t match gaps on both sequences n4We use the stirling approximation: n! ≈√2πn n en 22.3 Formulation 1: Longest Common Substring As a first attempt, suppose we treat the nucleotide sequences as strings over the alphabet A, C, G, and T. Given two such strings, R and S, we might try to align them by finding the longest common substring between them. In particular, these substrings cannot have gaps in them. As an example, if R = ACGTCATCA and S = TAGTGTCA, the longest common substring between them is GTCA. So in this formulation, we could align R and S along their longest common substring, GTCA, to get the most matches. A simple algorithm


View Full Document
Download Sequence Alignment and Dynamic Programming
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Sequence Alignment and Dynamic Programming and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Sequence Alignment and Dynamic Programming 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?