DOC PREVIEW
Stanford CS 262 - Study Notes

This preview shows page 1-2 out of 5 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 5 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 5 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 5 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

CS262: Computational Genomics - 1 - Problem Set 4 Due at the beginning of class on Tuesday, March 14. Collaboration is allowed in groups of at most three students, but you must submit separate write-ups. Please also write the names of all your collaborators on your submissions. If you are working alone, we will drop the problem with the lowest score. Problem 1. (25 points) Chaining Local Alignments (a) (10 points) Consider the following problem: Let S be a sequence of k numbers n1, ..., nk. Let each number have an associated weight w1, ...,wk. We want to find the heaviest increasing subsequence of S; that is, a subsequence m1iin...n << (where i1 < ... < im), such that m1iiw...w ++ is maximum. Show how to solve this problem in O(k log k) time by reducing it to the chaining problem presented in class, for which a sparse dynamic programming algorithm was described. (b) Let X and Y be DNA sequences of lengths LX and LY respectively. Let AX and AY be two sequences of (n − 1) anchors that will be used to align X and Y. Anchors in AX and AY are ordered from left to right (anchor i of AX will match anchor i of AY , and is located to the left of anchor j of AX if and only if i < j). Throughout this problem, assume the lengths of the anchors themselves are negligible. Let xi be the distance between anchors i−1 and i in AX, and similarly for yi. We also include the distances between the beginning of the sequence and the first anchor, and between the last anchor and the end; therefore, the sets Xd = x1, . . . , xn and Yd both have cardinality n, one greater than the number of anchors. We will globally align X and Y by fixing the alignments of corresponding anchors to each other, and using Needleman-Wunsch to align the portions of X and Y between neighboring anchors. We want to calculate the running time of this anchored alignment in terms of xi and yi. For the following two questions, consult the list of definitions at the end of this problem. … 1x2x1nx− nx3x2y1ny−ny3y1yCS262: Computational Genomics - 2 - (i) (7 points) Suppose LX = L = LY and the anchors in X and Y are spaced in exactly the same way; that is, xi = yi for all i. Write down the running time as a function of only those terms: L, n, and Var(Xd). (ii) (8 points) Without making any assumptions about the relationship between the anchors in X and Y, write down the running time as a function of only those terms: LX, LY, n, and Cov(Xd, Yd). Definitions: Mean: ∑==n1iixn1x Variance: ( )∑=−=n1i2idxxn1)X(Var Covariance: ( )( )( )∑=−−=n1iiiddyyxxn1)Y,X(Cov Problem 2. (25 points) Multiple Sequence Alignments (a) (12 points) Consensus multiple alignment versus sum-of-pairs multiple alignment. Definitions: (adapted from Gusfield, p. 352) (1) Given a multiple alignment M of a set of strings S, the consensus character of column i of M is the character that maximizes the summed score between the character and all the characters in column i. (In case of ties, say by convention that we prefer A over C over G over T over ‘gap’). The score of (gap, gap) is 0. Let d(i) denote that maximum summed score in column i. (2) The consensus string SM derived from alignment M is the concatenation of the consensus characters for each column of M. (3) The alignment score of SM equals to the sum of column scores d(SM) = d(1) + … + d(m), where M has m columns. (4) The optimal consensus multiple alignment is a multiple alignment M for input string set S whose consensus string SM has largest alignment score over all possible multiple alignments of S. Example: S = { AGCC, ACC, TCC }, and match, mismatch, gap = +2, -2, -3. Consider the following alignments: M1: { AGCC, A-CC, T-CC }; SM1 = A-CC, and d(SM1) = (+2) + (-3) + (+6) + (+6) = 11. M2: { AGCC, A-CC, -TCC }; SM2 = AGCC, and d(SM2) = (+1) + (-3) + (+6) + (+6) = 10. Show an example with three or more sequences where all optimal multiple alignments according to the above model are different from all optimal alignments according to the Sum-Of-Pairs model. In other words, since there may be several equally-scoring optimal alignments, theCS262: Computational Genomics - 3 - set of optimal alignments for the consensus model must be disjoint from the set of optimal alignments for the Sum-Of-Pairs model. Assume either a match, mismatch, and gap penalty of (+2, -2, -3) or (+2, -1, -1) (you may find the second set of scoring parameters easier to prove). Let the alphabet be {A,C,G,T}. (b) (13 points) Phylogenetic-tree–based alignment. Definitions: (1) Given an input rooted binary tree T with a distinct string (from a set of strings S) written at each leaf, a phylogenetic alignment for T is an assignment of one string to each internal node of T, subject to the additional constraint described below. Note that the strings assigned to internal nodes need not be distinct and need not be from the input strings S. (2) If strings S and S’ are assigned to the endpoints of an edge (i, j), then that edge has edge distance D(S, S’), which is simply the maximum pair-wise alignment score between the two strings S and S’. The distance of a phylogenetic alignment is the total of all edge distances in the tree. (3) The phylogenetic alignment problem for T is to find an assignment of strings to internal nodes of T (one string to each node) that maximizes the distance of the alignment. Additional constraint: In the maximum pair-wise alignment between a leaf node x and an internal node S, a letter in x must match or mismatch some letter in S, and not be gapped. Also, the maximum pair-wise alignment between two adjacent internal nodes S and S’ may not have any gaps. Then, constructing the multiple alignment of the input sequences given the phylogenetic alignment is straight-forward. Note: it is also possible to define the problem where T is unrooted. If you prefer that definition, please go ahead and use it instead. Example: S = {x = ACC, y = AGCC, z = TCC}, and match, mismatch, gap = +2, -2, -3. Let T = [Nodes = {x, y, z, vyz, vxyz}; Edges = {(x, vxyz), (vxyz, vyz), (vyz, y), (vyz, z)}, Root = vxyz], with leafs {x, y, z} and root vxyz. Here is a phylogenetic alignment, which is just a labeling of the internal nodes: Label vyz with “AGCC”, and vxyz with “AGCC”. Then, the alignment score is D(x, vxyz) + D(vxyz, vyz) + D(vyz,


View Full Document

Stanford CS 262 - Study Notes

Documents in this Course
Lecture 8

Lecture 8

38 pages

Lecture 7

Lecture 7

27 pages

Lecture 4

Lecture 4

12 pages

Lecture 1

Lecture 1

11 pages

Biology

Biology

54 pages

Lecture 7

Lecture 7

45 pages

Load more
Download Study Notes
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Study Notes and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Study Notes 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?