03-511/711 Computational Genomics and Molecular Biology, Fall 2010 1Problem Set 2b Due October 13th, 20111. Define an HMM H with three states {A, B, C} and alphabet {0, 1, 2} and the following tran-sition and emission probabilities:A B C 0 1 2A 0.2 0.8 0.0 0.8 0.2 0.0B 0.0 0.8 0.2 0.0 0.6 0.4C 0.4 0.0 0.6 0.2 0.0 0.8(a) Draw the state diagram of this HMM and show the transition probabilities.(b) Assuming that initial state is A, give all of the possible state paths for the sequenceO = 0, 1, 2, 0.03-511/711 Computational Genomics and Molecular Biology, Fall 2010 2(c) For each possible state path, Qi, calculate the probability of emitting O from the statesin that path. What is the total probability of the sequence, P (O)?(d) What is the most probable path, Q∗? What is the probability of emitting O via themost probable path?(e) The Forward algorithm calculates P (O), the probability that the model will emit a givensequence, O, over all possible paths. One might consider approximating this probabilityby calculating P (O |Q∗) using the Viterbi algorithm. For this particular HMM, wouldP (O|Q∗) be a goo d approximation of P (O)? Explain your reasoning.03-511/711 Computational Genomics and Molecular Biology, Fall 2010 32. Consider the following HMM that emits sequences, drawn from a letter alphabet, Σ = {H, L},that participate in coiled coil structures. All paths begin in the initial state, Si, andend in the termination state, St. Note that thes e are not silent states. The transitionand emission probabilities for this HMM are:Transition probabilities Emission probabilitiesSiA B C D E F G StH LSi0.5 0.5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.5 0.5A 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.9 0.1B 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.2 0.8C 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.2 0.8D 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.9 0.1E 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.2 0.8F 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.2 0.8G 0.0 0.5 0.0 0.0 0.0 0.0 0.0 0.0 0.5 0.2 0.8St0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.5 0.5Coiled coils are protein secondary structures formed by two intertwined alpha helices. Se-quences that participate in coiled coils have a characteristic seven-fold repeat. The sevenpositions in this patter are typically denoted by the letters A-G. The residues in the A andD p os itions are at the interface of the two helices and, hence, are hydrophobic. The residuesin the B, C, E, F and G positions are exposed and typically hydrophilic.(a) Draw the state diagram for this HMM. Label the transitions with their probabilities.03-511/711 Computational Genomics and Molecular Biology, Fall 2010 4(b) What is the length of the shortest sequence that this HMM can generate? What is thethe length of the longest longest sequence that this HMM can recognize?(c) Coiled coils sequences typically have two or more consecutive copies of the seven-foldrepeat. How does the probability of coiled coil sequences generated by this HMM varywith copy numb e r? Are sequences with two copies of the s even-fold repeat more likelythan sequences with just one copy?(d) Coiled coils sometimes contain an “offset,” i.e., insertions of a few amino acids betweencopies of the seven-fold repeat. Can this HMM emit coiled coil sequences with an offset?Explain your answer. If not, how would you modify the HMM to emit sequences withan offset between the copies?03-511/711 Computational Genomics and Molecular Biology, Fall 2010 53. There are three stop codons: TAG, TAA, and TGA. In the extremophile bacterium Deinococ-cus radiodurans, the relative frequency of these stop codons is 5%, 12%, and 83%. Constructan HMM that emits these stop codons, and only these codons, using the smallest number ofstates possible. (Hint: the m inimum number of states re quired is less than nine.)Give the topology of model and the initial, transition, and emission probabilities. Your modelshould emit the three alternate stop codons in the correct frequencies. There is more thanone solution; just give one.03-511/711 Computational Genomics and Molecular Biology, Fall 2010 64. Consider the following three-state HMM for transmembrane proteins:The parameters for this model arei E M Cπi0 0 1ei(H) 0.2 0.9 0.3ei(L) 0.8 0.1 0.7(a) Complete the partial calculation on the next page by by filling in the blank boxes in thematrix shown.(b) What algorithm is being used to generate this matrix?(c) What is, Q∗(O), the most probable path given O = LHHL?(d) Give the most probable state for each time step, t = 0, 1, 2, 3, 4(e) For how many time steps is the state on the most probable path also the most probablestate. For which times, if any, are the most probable state and the s tate on the mostprobable path not the same?π( C) e( C) π( M) e( M) π( E) e( E)L t=0 1 0.7 0.7000 0 0.1 0.0000 0 0.8 0.0000j δ(t,j) a(j,C) e( C) δ(t,j) a(j,M) e( M) δ(t,j) a(j,E) e( E)H t=1 C 0.7000 0.85 0.3 0.1785 0.7000 0.15 0.9 0.0945 0.7000 0 0.2 0.0000M 0 0.2 0.3 0.0000 0 0.6 0.9 0.0000 0 0.2 0.2 0.0000E 0 0 0.3 0.0000 0 0.25 0.9 0.0000 0 0.75 0.2 0.00000.1785 0.0945 0.0000H t=2 C 0.1785 0.85 0.3 0.0455 0.1785 0.15 0.9 0.0241 0.1785 0 0.2 0.0000M 0.0945 0.2 0.3 0.0057 0.0945 0.6 0.9 0.0510 0.0945 0.2 0.2 0.0038E 0.0000 0 0.3 0.0000 0.0000 0.25 0.9 0.0000 0.0000 0.75 0.2 0.00000.0455 0.0510 0.0038L t=3 C 0.85 0.7 0.15 0.1 0 0.8M 0.2 0.7 0.6 0.1 0.2 0.8E 0 0.7 0.25 0.1 0.75 0.8Final probability:CM E03-511/711 Computational Genomics and Molecular Biology, Fall 2010 75. Consider the same TM-HMM as in the previous question:with model parameters:i E M Cπi0 0 1ei(H) 0.2 0.9 0.3ei(L) 0.8 0.1 0.7(a) Calculate P (O|TM-HMM), the total probability of the sequence O = LHHL with re-spect to the transmembrane model. Use the worksheet on the next page. What isP (LHHL|TM-HMM)?(b) What algorithm did you use to generate this matrix?π( C) e( C) π( M) e( M) π( E) e( E)L t=0j α(t-1,j) a(j,C) e( C) α(t-1,j) a(j,M) e( M) α(t-1,j) a(j,E) e( E)H t=1 CMEα(t,j) α(t,j) α(t,j)H t=2 CMEL t=3 CMEFinal probability:CM
View Full Document