DOC PREVIEW
CMU BSC 03711 - Homework

This preview shows page 1-2-3-4 out of 13 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 13 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 13 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 13 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 13 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 13 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

03-511/711 Computational Genomics and Molecular Biology, Fall 20101Problem Set 3 Due November 18thCollaboration is allowed on this homework. You must hand in homeworks individually and list thenames of the people you worked with.Turn in your handwritten answers on the attached sheets.1. The Kimura 2-parameter model of sequence evolution distinguishes between transitions (purine-purine and pyrimidine-pyrimidine replacements) and transversions (purine-pyrimidine and pyrimidine-purine replacements). Under the Kimura model, the expected number of sites at which a substitutionoccurred can be estimated from the number of mismatching sites byd = −N(0.5ln(1− 2 ˆp1− ˆp2) + 0.25ln(1− 2 ˆp2)),where N is the length of the alignment and ˆp1and ˆp2are the number of transitions per site and thenumber of transversions per site, respectively.(a) Suppose you are given an alignment of two sequences, each of length 1400, with 110 transitionsand 30 transversions. What is the estimated substitution distance between the two sequences?(b) Calculate the estimated substitution distance between the same two sequences using the Jukes-Cantor model. Does the Jukes-Cantor model underestimate or overestimate the distance com-pared with the Kimura 2-parameter model?03-511/711 Computational Genomics and Molecular Biology, Fall 201022. Define an HMM H with three states {S1, S2, S3}, alphabet {a, b, c}, initial state probabilities π1=0.25, π2= 0.75, π3= 0 and the following transition and emission probabilities:S1S2S3a b cS10.0 0.5 0.5 0.5 0.5 0.0S21.0 0.0 0.0 0.3 0.3 0.4fg S30.0 1.0 0.0 0.25 0.0 0.75(a) Draw the state diagram of this HMM and show the transition probabilities.(b) Write down all of the possible state paths for the sequence O = a, c, a.03-511/711 Computational Genomics and Molecular Biology, Fall 20103(c) What is P(O)?(d) What is the most probable path, Q∗? What is P(O|Q∗), the probability of O for this path?(e) The Forward algorithm calculates P(O), the probability that the model will emit a given se-quence, O, over all possible paths. One might consider approximating this probability by calcu-lating P(O|Q∗) using the Viterbi algorithm. For this particular HMM, would P(O|Q∗) be a goodapproximation P(O)? Explain your reasoning.03-511/711 Computational Genomics and Molecular Biology, Fall 201043. We are given the following aligned sequences for the fabA transcription factor binding site in severalbacterial species and wish to construct a PSSM.HI AGAAAHI AGTTTPA ACAAAEC TGTTCVC TGTTTYP TGCTC(a) Compute the frequency and the log-odds scoring matrices for A, C, G, and T for all columnsin the alignment (remember to use log base 2). Correct for the zero frequency case using pseu-docounts with b = 1. Assume that the background probability for each base is 0.25. Show thefrequency and log-odds matrices.(b) Calculate the log-odds score for each of the following sequences using your PSSM.•AGATC•AGAAA•TGTTC03-511/711 Computational Genomics and Molecular Biology, Fall 201054. Consider the following HMM to recognize the same family of the fabA transcription factor bindingsites. All paths begin in the start state, 0, and end in the final state, f. States 0 and f emit no token.The transition and emission probabilities for this HMM are:0 S1S2S3S4S5S6S7S8f A C G T0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0S10.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.47 0.03 0.03 0.47S20.0 0.0 0.0 0.33 0.0 0.0 0.67 0.0 0.0 0.0 0.03 0.14 0.80 0.03S30.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.91 0.03 0.03 0.03S40.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.91 0.03 0.03 0.03S50.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.91 0.03 0.03 0.03S60.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.03 0.22 0.03 0.72S70.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.03 0.03 0.03 0.91S80.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.03 0.47 0.03 0.47(a) Draw the state diagram and the transition probabilities for this HMM.03-511/711 Computational Genomics and Molecular Biology, Fall 20106(b) Determine the probability of observing each of the following sequences. Do you need the For-ward algorithm to do this? Why or why not?•AGATC•AGAAA•TGTTC(c) Notice that there are a lot of correlations in the multiple alignment of fabA binding sites. If thethird position is a ’T’, then the fourth position is also a ’T’ and the fifth position usually is, too.Similarly, if the third position is an ’A’, the fourth and fifth positions are also ’A’. You can seeother correlations as well.Compare your results in part (c) with your results from Problem 3. How do HMMs and PSSMscompare in their ability to recognize correlations?03-511/711 Computational Genomics and Molecular Biology, Fall 201075. Coiled coils have a repeated a pattern that recurs every seven positions. This pattern results from thebiophysical constraints imposed by the coiled coil structure. For example, positions 1 and 4 are onthe inside of the coiled coil structure and tend to be hydrophobic.(a) You want to build an HMM that exploits this repeated pattern to recognize sequences that par-ticipate in coiled coils. As training data, you have a set of known coiled coil sequences in whichthe starting and ending positions of the coiled coil region, as well as the seven positions of thepattern, are labeled.Once you have constructed your topology, what method should you use to determine the tran-sition and emission probabilities? Give the name of the method and any equations you mightneed.(b) You have trained the above HMM and now you want to determine the location of the coiled coilregion in a test sequence that is known to contain a coiled coil. Give the names of two methodsyou could use to label the data? In each case, how will the output tell you the location?03-511/711 Computational Genomics and Molecular Biology, Fall 20108(c) All coiled coils have the length 7 repeated pattern, but the pattern is somewhat different in coiledcoils that participate in 4-helix bundles and those that participate in coiled coil pairs. You wishto construct separate HMM’s, one for that recognizes coiled coil pairs and one that recognizes4-helix bundles. Both models will have the same topology, but the parameters will be different.As training data, you are given two sets of unlabeled sequences, one containing coiled coilsequences that participate in pairwise interactions and the other containing sequences that form4-helix bundles. In this case, what method will you use to determine the transition and


View Full Document

CMU BSC 03711 - Homework

Documents in this Course
lecture

lecture

8 pages

Lecture

Lecture

3 pages

Homework

Homework

10 pages

Lecture

Lecture

17 pages

Delsuc05

Delsuc05

15 pages

hmwk1

hmwk1

2 pages

lecture

lecture

6 pages

Lecture

Lecture

10 pages

barnacle4

barnacle4

15 pages

review

review

10 pages

Homework

Homework

10 pages

Midterm

Midterm

12 pages

lecture

lecture

11 pages

lecture

lecture

32 pages

Lecture

Lecture

7 pages

Lecture

Lecture

17 pages

Lecture

Lecture

12 pages

Lecture

Lecture

21 pages

Lecture

Lecture

11 pages

Lecture

Lecture

28 pages

Logistics

Logistics

11 pages

lecture

lecture

11 pages

Lecture

Lecture

8 pages

Lecture

Lecture

9 pages

lecture

lecture

8 pages

Problem

Problem

6 pages

Homework

Homework

10 pages

Lecture

Lecture

9 pages

Problem

Problem

7 pages

hmwk4

hmwk4

7 pages

Problem

Problem

6 pages

lecture

lecture

16 pages

Problem

Problem

8 pages

Problem

Problem

6 pages

Problem

Problem

13 pages

lecture

lecture

9 pages

Problem

Problem

11 pages

Notes

Notes

7 pages

Lecture

Lecture

7 pages

Lecture

Lecture

10 pages

Lecture

Lecture

9 pages

Homework

Homework

15 pages

Lecture

Lecture

16 pages

Problem

Problem

15 pages

Load more
Download Homework
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Homework and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Homework 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?