Stanford CS 262 - Lecture 15 - Multiple Sequence Alignments - D367956

Home> Schools> Stanford University> Computer Science (CS) > CS 262> Lecture 15 - Multiple Sequence Alignments

DOC PREVIEW

Stanford CS 262 - Lecture 15 - Multiple Sequence Alignments

School name Stanford University

Course Cs 262- Computational Genomics

Pages 11

This preview shows page 1-2-3-4 out of 11 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 11 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 11 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 11 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 11 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 11 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Lecture 15: Multiple Sequence AlignmentsScribed by Yuliya SarkisyanMethods to CHAIN Local AlignmentsWhen we have two long genomic regions to align, we first find, with a quick local alignment algorithm such as BLAST, all highly significant local alignments between the two regions. Once we have the local alignments, we look for a chain of the local alignments that produces the highest score (based on the number of matching nucleotides or some other more sophisticated test). After we have found this chain, we define a restricted area of dynamic programming around those local alignments in which we allow the local alignments to be modified by a small amount to produce a good global alignment. This restriction of dynamic programming saves a lot of computational time, reducing the problem to linear time in practice. Sparse Dynamic Programming - O(N log N)THE PROBLEMGiven the local alignments of two sequences, we want to find a chain of local alignmentsthat is of high weight. This is a difficult problem because due to repeats and other factors, we can have as many as thousands or even millions of local alignment hits returned by an algorithm such as BLAST. Moreover, we want to do this in time less than quadratic in the number of local alignments.(x,y)  (x’,y’)requiresx < x’y < y’Each local alignment has a weightFIND the chain with highest total weightSPARSE DYNAMIC PROGRAMMINGA related problem is Longest Common Subsequence (LCS), in which we are given two sequences, x = x1, …, xm and y = y1, …, yn, between which the matches are sparse. For example, in the figure below, the shaded squares represent a match between the two sequences.This problem can be solved in quadratic time by running a global alignment algorithm in which the score of match is 1 and the score of a gap or a mismatch is 0. However, if the total number of matches is small, we can do better than that.THE ALGORITHMRecall the algorithm for finding the longest increasing subsequence in time O(N log N). Let input be w: w1,…, wnINITIALIZATION:L: last LIS elt. array L[0] = -inf L[1] = w1 L[2…n] = +infB: array holding LIS elts; B[0] = 0P: array of backpointers// L[j]: smallest jth element wi of j-long LIS seen so farITERATION:for i = 2 to n { Find j such that L[j – 1] < w[i] ≤ L[j]L[j]  w[i]B[j]  iP[i]  B[j – 1]}We can use the LIS algorithm to solve the Sparse LCS problem in O(N log N) time. We reduce LCS to LIS as follows: we will create a sequence w in which every element of w will consist of a pair of coordinates in x and y that signify a match between the two sequences. For example, in the figure below each square (pair of coordinates) is labeled in the order in which it will appear in w, i.e. w = (4,2) (3,3) (10,5) (2,5) (8,6) (1,6) (3,7) (4,8) (7,9) (5,9) (9,10).We insert the matches into w by increasing column number and decreasing row number and note that a = (y, x), b = (y’, x’) can be chained iff a is before b in w, and y < y’. Now, if we order the elements according to the y-coordinate, then an increasing subsequence in w is a common subsequence of x and y.24420 316 181124315311591081724618151132420420411xyEXAMPLERunning the LIS algorithm on the w constructed above we get the following result:w = (4,2) (3,3) (10,5) (2,5) (8,6) (1,6) (3,7) (4,8) (7,9) (5,9) (9,10) L = [L1] [L2] [L3] [L4] [L5] … 1. (4,2)2. (3,3)3. (3,3) (10,5)4. (2,5) (10,5)5. (2,5) (8,6)6. (1,6) (8,6)7. (1,6) (3,7)8. (1,6) (3,7) (4,8)9. (1,6) (3,7) (4,8) (7,9)10. (1,6) (3,7) (4,8) (5,9)11. (1,6) (3,7) (4,8) (5,9) (9,10)Longest common subsequence:s = 4, 24, 3, 11, 1824420 316 181124315311591081724618151132420420411xySPARSE DP FOR RECTANGLE CHAININGNow that we know how to find the longest common subsequence in O(N log N) time, let’s get back to the problem of local alignment chaining. The differences between the two problems are as follows: 1. Whereas in the longest common subsequence problem the score of a matchbetween two letters was 1, in the problem of local alignment chaining the score of an alignment is its value.2. In local alignment chaining, the coordinates are for squares and can therefore overlap.Therefore, we now keep track of the following values:1. 1,…, N: rectangles2. (hj, lj): y-coordinates of rectangle j3. w(j): weight of rectangle j4. V(j): optimal score of chain ending in j5. L: list of triplets (lj, V(j), j)Where L is sorted by lj: smallest (North) to largest (South) value.Therefore, the main idea of the algorithm is to sweep through the x-coordinates (just as before) until we encounter a leftmost coordinate of some rectangle. We then have a choice of chaining this rectangle to any rectangle whose right most coordinate we have already seen and whose l-coordinate is smaller than the h-coordinate of the encountered rectangle. Among all the valid choices, we will choose the rectangle j with the highest V(j) score.lyhlV(b)V(a)THE ALGORITHMGo through rectangle x-coordinates, from lowest to highest:1. When on the leftmost end of rectangle i:a. j: rectangle in L, with largest lj < hib. V(i) = w(i) + V(j)1. When on the rightmost end of i:a. k: rectangle in L, with largest lk  lib. If V(i) > V(k):i. INSERT (li, V(i), i) in Lii. REMOVE all (lj, V(j), j) with V(j)  V(i) & lj  liL is stored as a binary tree. Therefore, all the insertions and deletions can be done in O(log N) time.RUNNING TIMEWe first sort the x-coordinate of the local alignments, which take O(N log N) time. We have to go through all of the x-coordinates, which constitutes N steps. Each step requires O(log N) time because searching, inserting, and deleting in L takes O(log N). Each element is deleted at most once. Therefore, the time for all of the deletions in O(N log N). Thus the total running time is O(N log N).WHOLE-GENOME ALIGNMENTSGiven N species for which we know the phylogenetic tree, a multiple alignment is produced by first finding local alignments between all pairs of species. Then, in the order of tree, do progressive alignment. There is one complication, however. We have to first do something called synteny mapping, which means finding long regions with lots of collinear alignments. Then for each synteny region, we perform chaining and global alignment.ijkGene RecognitionTHE CENTRAL DOGMAThe basic unit of expression in an organism is a gene. DNA information is encoded into gene structures that produce proteins, which are in turn the structural and functional units of

View Full Document