03-511/711 Computational Genomics and Molecular Biology, Fall 20101Problem Set 3 Due November 18thCollaboration is allowed on this homework. You must hand in homeworks individually and list thenames of the people you worked with.Turn in your handwritten answers on the attached sheets.1. The Kimura 2-parameter model of sequence evolution distinguishes between transitions (purine-purine and pyrimidine-pyrimidine replacements) and transversions (purine-pyrimidine and pyrimidine-purine replacements). Under the Kimura model, the expected number of sites at which a substitutionoccurred can be estimated from the number of mismatching sites byd = −N(0.5ln(1− 2 ˆp1− ˆp2) + 0.25ln(1− 2 ˆp2)),where N is the length of the alignment and ˆp1and ˆp2are the number of transitions per site and thenumber of transversions per site, respectively.(a) Suppose you are given an alignment of two sequences, each of length 1400, with 110 transitionsand 30 transversions. What is the estimated substitution distance between the two sequences?(b) Calculate the estimated substitution distance between the same two sequences using the Jukes-Cantor model. Does the Jukes-Cantor model underestimate or overestimate the distance com-pared with the Kimura 2-parameter model?03-511/711 Computational Genomics and Molecular Biology, Fall 201022. Define an HMM H with three states {S1, S2, S3}, alphabet {a, b, c}, initial state probabilities π1=0.25, π2= 0.75, π3= 0 and the following transition and emission probabilities:S1S2S3a b cS10.0 0.5 0.5 0.5 0.5 0.0S21.0 0.0 0.0 0.3 0.3 0.4fg S30.0 1.0 0.0 0.25 0.0 0.75(a) Draw the state diagram of this HMM and show the transition probabilities.(b) Write down all of the possible state paths for the sequence O = a, c, a.03-511/711 Computational Genomics and Molecular Biology, Fall 20103(c) What is P(O)?(d) What is the most probable path, Q∗? What is P(O|Q∗), the probability of O for this path?(e) The Forward algorithm calculates P(O), the probability that the model will emit a given se-quence, O, over all possible paths. One might consider approximating this probability by calcu-lating P(O|Q∗) using the Viterbi algorithm. For this particular HMM, would P(O|Q∗) be a goodapproximation P(O)? Explain your reasoning.03-511/711 Computational Genomics and Molecular Biology, Fall 201043. We are given the following aligned sequences for the fabA transcription factor binding site in severalbacterial species and wish to construct a PSSM.HI AGAAAHI AGTTTPA ACAAAEC TGTTCVC TGTTTYP TGCTC(a) Compute the frequency and the log-odds scoring matrices for A, C, G, and T for all columnsin the alignment (remember to use log base 2). Correct for the zero frequency case using pseu-docounts with b = 1. Assume that the background probability for each base is 0.25. Show thefrequency and log-odds matrices.(b) Calculate the log-odds score for each of the following sequences using your PSSM.•AGATC•AGAAA•TGTTC03-511/711 Computational Genomics and Molecular Biology, Fall 201054. Consider the following HMM to recognize the same family of the fabA transcription factor bindingsites. All paths begin in the start state, 0, and end in the final state, f. States 0 and f emit no token.The transition and emission probabilities for this HMM are:0 S1S2S3S4S5S6S7S8f A C G T0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0S10.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.47 0.03 0.03 0.47S20.0 0.0 0.0 0.33 0.0 0.0 0.67 0.0 0.0 0.0 0.03 0.14 0.80 0.03S30.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.91 0.03 0.03 0.03S40.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.91 0.03 0.03 0.03S50.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.91 0.03 0.03 0.03S60.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.03 0.22 0.03 0.72S70.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.03 0.03 0.03 0.91S80.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.03 0.47 0.03 0.47(a) Draw the state diagram and the transition probabilities for this HMM.03-511/711 Computational Genomics and Molecular Biology, Fall 20106(b) Determine the probability of observing each of the following sequences. Do you need the For-ward algorithm to do this? Why or why not?•AGATC•AGAAA•TGTTC(c) Notice that there are a lot of correlations in the multiple alignment of fabA binding sites. If thethird position is a ’T’, then the fourth position is also a ’T’ and the fifth position usually is, too.Similarly, if the third position is an ’A’, the fourth and fifth positions are also ’A’. You can seeother correlations as well.Compare your results in part (c) with your results from Problem 3. How do HMMs and PSSMscompare in their ability to recognize correlations?03-511/711 Computational Genomics and Molecular Biology, Fall 201075. Coiled coils have a repeated a pattern that recurs every seven positions. This pattern results from thebiophysical constraints imposed by the coiled coil structure. For example, positions 1 and 4 are onthe inside of the coiled coil structure and tend to be hydrophobic.(a) You want to build an HMM that exploits this repeated pattern to recognize sequences that par-ticipate in coiled coils. As training data, you have a set of known coiled coil sequences in whichthe starting and ending positions of the coiled coil region, as well as the seven positions of thepattern, are labeled.Once you have constructed your topology, what method should you use to determine the tran-sition and emission probabilities? Give the name of the method and any equations you mightneed.(b) You have trained the above HMM and now you want to determine the location of the coiled coilregion in a test sequence that is known to contain a coiled coil. Give the names of two methodsyou could use to label the data? In each case, how will the output tell you the location?03-511/711 Computational Genomics and Molecular Biology, Fall 20108(c) All coiled coils have the length 7 repeated pattern, but the pattern is somewhat different in coiledcoils that participate in 4-helix bundles and those that participate in coiled coil pairs. You wishto construct separate HMM’s, one for that recognizes coiled coil pairs and one that recognizes4-helix bundles. Both models will have the same topology, but the parameters will be different.As training data, you are given two sets of unlabeled sequences, one containing coiled coilsequences that participate in pairwise interactions and the other containing sequences that form4-helix bundles. In this case, what method will you use to determine the transition and
View Full Document