DOC PREVIEW
Stanford CS 262 - Computational Genomics

This preview shows page 1 out of 4 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 4 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 4 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

CS262: Computational genomicsCS 262 – Problem Set 3(due at the beginning of class on Thursday, Feb 26)Collaboration is allowed, but you must submit separate writeups. Please alsowrite the names of all your collaborators on your submissions.1. Sequence assembly(a) Let F be a collection of fragments. The overlap multigraph of F, denoted as OM(F) is a directed, weighted multigraph. The set V of nodes of this structure is just F itself. A directed edge from a ε F to a different fragment b ε F with weight t >= 0 exists if the suffix of a with t characters is a prefix of b.i. Explain how directed paths in this graph give rise to a multiple alignment of sequences belonging to this path. Also explain how a consensus sequence can be derived, providing a common superstring of the involved sequences. ii. Let P be a path in OM(F) that goes through every vertex (P is any Hamiltonian path). Let S(P) be the common superstring derived from P. Let w(P) be the weight of P. Prove that minimizing |S(P)| is equivalent tomaximizing w(P). (Note: S(P) is the sequence of the target DNA molecule to be assembled)iii. A collection of fragments F is said to be substring-free if there are no two distinct strings a and b in F such that a is a substring of b. Provethat if S is a shortest common superstring of F, there is a Hamiltonian path P such that S = S(P).iv. If F is a collection of strings, prove that there is a unique substring-free collection G equivalent to F (ie, having the same superstring). Howdoes this result help you?v. Prof. Kotovsky suggests a greedy approach to solve the sequence assembly problem formulated as a shortest common superstring problem. He says, “We know that looking for shortest common superstrings is the same as looking for Hamiltonian paths of maximum weight in a directed multigraph. To maximize the weight, we can simplify the multigraph and consider only the heaviest edge between every pair of nodes, discarding other parallel edges of smaller weight. To compute the heaviest path, continuously add the heaviest available edge, which is one that does not upset the construction of a Hamiltonian path given the previously chosen edges. Because the graph is complete (zero-weight edges are also assumed to be present),this process stops only when a path containing all vertices is formed.” Prove that this greedy strategy does not always produce the best result.2. Chaining Local Alignments(a) Consider the following problem:1CS262: Computational genomicsLet S be a sequence of numbers, n1, . . . , nk. Let each number have an associated weight, w1, . . . ,wk. Find the heaviest increasing subsequence of S, that is, a subsequence ni1 , . . . , nim such that i1 < . . . < im, ni1 < . . . < nim, andwi1 + . . . + wim is maximum. Show how to solve this problem in O(k log k) time by reducing it to thechaining problem presented in class, for which a sparse dynamic programming algorithm was presented.(b) Let X be a sequence of length L1 and Y be a sequence of length L2. Let AXand AY be two sequences of (n − 1) anchors that will be used to align XandY. Anchors in AX and AY are ordered from left to right (anchor i of AX willmatch anchor i of AY , and is located to the left of anchor j of AX if and only if i < j). Throughout this problem, assume the lengths of the anchors themselves arenegligible. Let xi be the distance between anchors i−1 and i in AX, and similarly for yi. We also include the distances between the beginning of the sequence and the first anchor, and between the last anchor and the end; therefore, the sets Xd = x1, . . . , xn and Yd both have cardinality n, 1 greater than the number of anchors.We will globally align X and Y by fixing the alignments of correspondinganchors to each other, and using NW to align the portions of X and Y between neighboring anchors.i. Write down a formula for the running time of this anchored alignmentinterms of xi and yi.ii. Suppose L1 = L = L2 and the anchors in X and Y align to each otherexactly; that is, xi = yi for all i. Write down the running time as a functionof L and n, and certain fundamental statistical quantities of the set Xd.What is the running time if we assume xi is distributed according to aPoisson distribution? Discuss the validity of such an assumption.2CS262: Computational genomicsiii. Without making any assumptions about the relationship between the anchors in X and Y , write down the running time as a function of L1, L2, n,and certain fundamental statistical quantities of the sets Xd and Yd. Whatis the running time if we assume xi and yi are distributed according to independent Poisson distributions? Discuss the validity of such an assumption.3. Multiple Alignments and Sequence Assembly(a) The recurrences in (6.7) do not penalize opening a gap. Write recurrences that do so. That is, the score of an alignment is the one implied in (6.7), except for an additional −kd where k is the number of opened gaps, and −d the constant penalty for opening a gap.(b) Consider the following scoring model:For each column, the most common letter (or gap) is the “consensus”. Then,each letter in the column that is different from the consensus incurs a constant penalty. Show an example with three or more sequences where their optimal multiplealignment according to the above model is different from the one according to a SP model. Assume a match, mismatch, and gap penalty of +1, -1, -3. Let the alphabet be {A,C,G,T}.(c) Assume that the human genome has a length G, and that there are N fragments to be assembled in total, each of length L. The coverage ais given by NL/G. Assume that G is much larger than L. The fragments are assumed to be taken at random from the original full-length sequence, so that ignoring end-effects, the position of the left-hand end of any fragment is uniformly distributed in (0,G). i. What is the mean proportion of the genome covered by contigs? Howmuch should the coverage be for 99% of the genome to be covered? How much should the coverage be for 99.9% of the genome to be covered?ii. What is the mean number of contigs, in terms of N,L and G?iii. What is the mean contig size?4. Suffix Trees(a) What is the number of nodes (including leaf nodes) and edges in the suffix tree for the following strings:i. ABABCCABCAB$ii. ABCDDCBA$(b) Suppose you have a string of length n over an alphabet of m letters(as usual, we append a $, which we assume is not in the alphabet, to our string to ensure m unique leaves). You may assume m,


View Full Document

Stanford CS 262 - Computational Genomics

Documents in this Course
Lecture 8

Lecture 8

38 pages

Lecture 7

Lecture 7

27 pages

Lecture 4

Lecture 4

12 pages

Lecture 1

Lecture 1

11 pages

Biology

Biology

54 pages

Lecture 7

Lecture 7

45 pages

Load more
Download Computational Genomics
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Computational Genomics and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Computational Genomics 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?