DOC PREVIEW
Stanford CS 262 - Problem Set 2

This preview shows page 1 out of 4 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 4 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 4 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

CS262: Computational genomicsCS 262 – Problem Set 2(due at the beginning of class on Thursday, Jan 29)Collaboration is allowed, but you must submit separate writeups. Please alsowrite the names of all your collaborators on your submissions.1. HMM LearningLet H be an HMM with K states, and let X1 . . .XN be N training sequenceseach of length L.(a) What is the computational complexity of a single revision round for the HMMparameters akl, ek(b) duringi. Baum-Welch trainingii. Viterbi training(b) Suppose we have a large number of sequences emitted by an HMM that has a particular transition probability akl = 0, for some k and l. Say that we now use these emitted sequences to train (using Baum-Welch) a new HMM with the same architecture, one that happens to start with akl = 0. Prove that the parameter akl will remain 0 after the training.(c) Initial parameters play an important role on the eventual result of HMM training. Both Baum-Welch and Viterbi training are, unfortunately,local optimizations.Prove this, using the following construction:i. Choose one algorithm: Baum-Welch or Viterbi.ii. Create an HMM which is the “true” model of a random process.iii. Create a set of (one or more) training sequences. Their number and length is your choice, but they should satisfy one requirement: Akl and Ek(b), the counts of each transition and emission in all the sequences, should be exactly consistent with akl and ek(b), the true model parameters. iv. Choose a set of initial nonzero parameters for a new HMM which has the correct architecture, such that your chosen learning algorithm converges to the true model.v. Choose another set of initial nonzero parameters, such that your chosen learning algorithm converges to a wrong local maximum.1CS262: Computational genomics(d) Given an observed sequence X, we have seen an efficient method in class for calculating argmaxП Prob(П | X). One might ask why we are interested in this П and not argmaxП Prob(X | П) instead. The latter state sequence, after all, is the one that if given has the highest probability of producing X. Illustrate why this is the wrong П to find, with the help of an example HMM.2. Pair HMMs for Alignments(a) What is the smallest pair HMM you can build that closely corresponds to running Needleman-Wunsch with constant (not affine) gap penalty?(b) The pair HMM for alignments in the Durbin book (Figure 4.2) does not allow a gap in x followed immediately by a gap in y. Modify the state diagram to allow for this, and make it such that the “gap open” penalty of the second gap (the gap in y) is an arbitrary value between the regular gap open penalty, and the gap extend penalty. Write down the modified Viterbi algorithm, using log-odds equations.(c) Construct a pair HMM for overlap matches. Show the state diagram together with transition probabilities (as functions of the parameters), but not necessarily the Viterbi recurrences. What is the running time ofthe Viterbi algorithm in this HMM? Try to include the constants in your estimate.(d) Construct a pair HMM for alignments bound to fall within k from the diagonal (like in bounded Needleman-Wunsch). How many states does your HMM have? What is your best alternative suggestion for running Viterbi efficiently for k-bounded alignments?3. Pair-HMMs continued.(a) Recall the example of the CpG islands. When aligning genomic sequences, one may want to account for the different dinucleotide frequencies in CpG versus non-CpG regions, and also for the potentiallydifferent rates of mutation of different dinucleotides between two organisms, in the two kinds of regions. Design a HMM (perhaps a second-order one) for alignment, that tries to incorporate a model of CpG islands. Show just the state diagram, together with an explanationof what each state means, and what the possible emissions are. Try to explain your design choice, and how it may improve overall quality of alignment. In this problem you may wish to combine some of these ideas: Substitution matrices, pair-HMMs, and higher-order HMMs. You are free to make your design choices and any reasonable answer with brief explanation, will receive full credit.2CS262: Computational genomics(b) Define a global alignment objective function as follows: given parameters m, s, d, and e, a substring xi . . . xj is locally aligned to yi’. . . yj’ with score S(i, j, i’, j’) equal to the optimal Needleman-Wunsch score for the substring xi . . . xj and yi’ . . . yj’ . Two such local alignments with scores S(l, r, l’, r’) and S(i, j, i’, j’) with l < i can be chained if r < i, and r’ < i’. The penalty for such a chaining is a gap penalty of d’ + e’ (i − r + i’ − r’).The optimal global alignment score is the maximal score s1+. . .+sk+p1+. . .+pk−1 that can be obtained by chaining k local alignments T1, . . . , Tk, i.e., suchthat every pair of local alignments Ti and Ti+1 can be chained. In this formula,si is used to represent the score of the ith alignment, and pi the penalty ofthe gap between Ti and Ti+1. Build a pair HMM that finds the optimal globalalignment under this objective function, in time and memory O(MN).(c) By drawing a pair HMM, show how you can obtain an algorithm for alignment with a gap scoring function that approximates any given function γ(n) with k linear segments. (See Page 18 of the scribed notes for Lecture 2) The Viterbi for that HMM should run in time O(MNk2).(d) As an extension of pair HMMs, consider k-HMMs, that align simultaneously k sequences, each of length at most N. At any given position, any number of sequences could be gapped. Describe a k-HMMin the same spirit as a pair HMM. Note, there may be a few design choices in terms of transitions between different states, so somewhat different answers may be OK for this part. How many parameters does your model have? What would be the running time of Viterbi, Forward, and Backward in your HMM, in terms of k, and N, where you are aligning k sequences of length N each?4. Gene Prediction (Note: This problem is open-ended)Prediction of genes in DNA is one of the most important annotation problems in bioinformatics. Genes are considerably more conserved than the rest of the DNA sequence. In this problem, you will be examining a very simplified version of gene recognition using comparative genomics.You are given the full genomes of two exotic animals, X and Y . The genes in these animals are conserved at 80% on average. As it turns out,


View Full Document

Stanford CS 262 - Problem Set 2

Documents in this Course
Lecture 8

Lecture 8

38 pages

Lecture 7

Lecture 7

27 pages

Lecture 4

Lecture 4

12 pages

Lecture 1

Lecture 1

11 pages

Biology

Biology

54 pages

Lecture 7

Lecture 7

45 pages

Load more
Download Problem Set 2
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Problem Set 2 and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Problem Set 2 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?