MIT 6 047 - Sequence Alignment and Dynamic Programming - D630936

Home> Schools> Massachusetts Institute of Technology> Electrical Engineering and Computer Science (6) > 6 047> Sequence Alignment and Dynamic Programming

MIT 6 047 - Sequence Alignment and Dynamic Programming

School name Massachusetts Institute of Technology

Course 6 047- Computational Biology

Pages 8

Download Save

Unformatted text preview:

MIT OpenCourseWare http://ocw.mit.edu 6.047 / 6.878 Computational Biology: Genomes, Networks, EvolutionFall 2008 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.6.047/6.878 Lecture 2 - Sequence Alignment and Dynamic Programming September 15, 2008 1 Introduction Evolution has preserved functional elements in the genome. Such elements often manifest themselves as homologues, or similar sequences in descendants of a common ancestor. The two main types of homologous sequences are orthologous and paralogous sequences. Orthologous sequences typically have similar functions in both the ancestor and the descendant and arise through speciation events, while paralagous sequences arise from common ancestors through gene duplication. Furthermore, paralagous genes imply a common ancestor but, due to mutation and evolution the functionality of that particular gene has shifted considerably. We will mostly be interested in studying orthologous gene sequences. Aligning sequences is one of the main ways of discovering such similarities between diﬀerent ancestors. And in solving sequence alignment problems, the primary computational tool will be Dynamic Programming. 1.1 An Example Alignment Within orthologus gene sequences, there are islands of conservation which are relatively large stretches of nucleotides that are preserved between generations. These conserved elements typically imply functional elements and vice versa. Although, note that conservation is sometimes just random chance. As an example, we considered the alignment of the Gal10-Gal1 intergenic region for four diﬀerent yeast species, the ﬁrst whole genome alignment for crosspieces (Page 1 Slide 5). As we look at this alignment, we note some areas that are more conserved than others. In particular, we note some small conserved motifs such as CGG and CGC, which in fact are functional elements in the binding of Gal4. This example illustrates how we can read evolution to ﬁnd functional elements. Figure 1: Sequence alignment of Gal10-Gal1. 11.2 Genome Changes are Symmetric The genome changes over time. In studying these changes, we’ll conﬁne ourselves to the level of nucleotide mutations, such as substitutions, insertions and deletions of bases. Lacking a time machine, we cannot compare genomes of living species with their ancestors, so we’re limited to just comparing the genomes of living descendants. For our purposes, time becomes a reversible concept. This means that we can consider the events that change a genome from one species to another as occurring in reverse order (tracing up an evolutionary tree towards a common ancestor) or forward order (tracing downwards from a common ancestor). We consider all DNA changes to be symmetric in time: an insertion in one direction is equivalent to a deletion in the other. Furthermore, since mutations and genomic changes are independent of each other, it is irrelevant to consider the “direction” of the branches in the evolutionary tree. Therefore to study the common ancestor of two diﬀerent species, we need only consider the intermediate stages in a mutation process from one of the descendants to the other. In doing so, we can avoid the problem of not having the ancestral nucleotide sequence. 2 Problem Formulations: Biology to CS We need to translate the biological problem of sequence alignment into a computer science problem that can be attacked by computational tools. In creating a model to represent the problem, we need to balance biological relevance with computational tractability. 2.1 Goal of Alignment The goal of alignment is to infer the ’edit operations’ that change a genome by looking only at the endpoints. We deﬁne the set of operations evolutionary operations to contain insertions, deletions, and mutations. And since there are many possible sequences of events that could change one genome to the other, we also need to establish an optimality criterion, e.g. minimum number of events or minimum cost. In reality, insertions and deletions are much less likely than substitutions, and so we would prefer to attribute a diﬀerent cost to each of those events to make solution sequences prefer substitutions. The algorithm we show will assume equal costs for each operation; however, it can be easily generalized for varying cost. 1 2 2.2 Brute-Force Solution Given a metric to score a given alignment, the simple brute-force algorithm just enumerates all possible alignments, computes the score of each one, and picks the alignment with the maximum score. How many possible alignments are there? If you consider only NBA3 and n ≈ m, the number of alignments is 4 � n + m � (n + m)! (2n)! � 4πn) (2en)2n 22n m = n!m! ≈ (n!)2 ≈ (√2πn nn 2)n 2 = √πn (1) en For some small values of n such as 100, the number of alignments is already too big (≈ 1060) for this enumerating strategy to be feasible. 1Even diﬀerent substitution events have diﬀerent likelihood since speciﬁc polymerases are more or less likely to make speciﬁc mistakes. 2There was a note about evolutionary pressure to keep the genome compact. For example, the genome of viruses is very small since they have the evolutionary pressure of ﬁtting it inside the small virus cell. Similarly, bacteria are pressured to keep their genome small to save energy in replication and to allow easy exchange of genomic material between cells. On the other hand, there is little pressure for human cells to maintain a small genome, so it is no surprise that the human genome is so large. 3Non Boring Alignments, an alignment that doesn’t match gaps on both sequences n4We use the stirling approximation: n! ≈√2πn n en 22.3 Formulation 1: Longest Common Substring As a ﬁrst attempt, suppose we treat the nucleotide sequences as strings over the alphabet A, C, G, and T. Given two such strings, R and S, we might try to align them by ﬁnding the longest common substring between them. In particular, these substrings cannot have gaps in them. As an example, if R = ACGTCATCA and S = TAGTGTCA, the longest common substring between them is GTCA. So in this formulation, we could align R and S along their longest common substring, GTCA, to get the most matches. A simple algorithm

View Full Document


School:
Email:
New Password:
Confirm Password:

MIT 6 047 - Sequence Alignment and Dynamic Programming

Sign up for free to view:

Please select your school