03-511/711 Computational Genomics and Molecular Biology, Fall 20111Study questionsThese study problems are intended to help you to review for the final exam. This is not an exhaustive listof the topics covered in the class and there is no guarantee that these questions are representative of thequestions on the final exam. You should also review your class notes, the syllabus and your home workassignments.1. Pairwise sequence alignment(a) Compare, dynamic programming for local, global and semiglobal alignment in terms of whenthey should be used; differences in initialization and termination condition; recursion relation,which scoring functions are suitable.2. Multiple sequence alignment(a) What are the differences between the exact dynamic programming for multiple sequence align-ment and the progressive alignment heuristic?(b) Score the following multiple sequence alignment using sum-of-pairs. Compare the scores andexplain why they differ?AGGAT03-511/711 Computational Genomics and Molecular Biology, Fall 201123. Phylogeny reconstruction(a) Concepts:i. Taxa, leaf nodes, internal nodes, root, branch lengths.ii. Rooted versus unrooted trees; what methods are used to root a tree? For each method, underwhat circumstances can they be applied?iii. Gene trees versus species treesiv. How many trees with k leaves?v. How to enumerate trees and/or traverse tree space.vi. Models of sequence substitution, what are they how are they used?vii. Character data: characters and character statesviii. Distance data: how distances are derived from sequence data; why it is necesary to correctsuch distances; additive and ultrametric distances.(b) For each of the phylogeny reconstruction methods listed below,i. Which types of information can they infer?ii. If a method cannot infer all of the types listed, what biological or algorithmic assumptionsimpose these limitations?iii. Which reconstruction methods are appropriate for each of the types of data listed above?Why?Phylogeny reconstruction methods• UPGMA• Neighbor Joining• Maximum parsimony• Maximum likelihoodInformation inferred by such methods• unrooted tree topology• rooted tree topology• branch lengths,• labels on internal nodesSequences• nucleotide• amino acid03-511/711 Computational Genomics and Molecular Biology, Fall 20113• evolving rapidly• under selective pressure• evolving at a constant rate in all lineages4. Review Fitch’s algorithm.(a) What does it do?(b) How does it work?(c) What can’t it do?5. Tree reconstructionBelow you see trees constructed using 5S small subunit ribosomal RNA genes from four bacterialspecies:M10815 Bacillus subtilis (BS)X02713 Lactobacillus viridescene (LV)M58416 Clostridium tyrobutyicum (CT)K02683 Micrococcus luteus (ML)UPGMA Tree:-------------------| || || || 13.14| || || |28.74 ---- ----| | || | 7.45| | || | -- --| 15.6 | || | 8.15 8.15| | | |ML LV BS CT03-511/711 Computational Genomics and Molecular Biology, Fall 20114Neighbor-Joining tree:CT//BS /10.872.08\ _ 1.86_ _// \/ \ 43.75/ \24.55 / \/ \/ \/ \/ \LV \\\\\\\\MT(a) Are the topologies the same? If not, how do they differ?03-511/711 Computational Genomics and Molecular Biology, Fall 20115(b) In which tree , is the distance from Lactobacillis viridescens to Micrococcus luteus most dis-torted with respect to the original distance matrix derived from the multiple sequence alignment?DISTANCE MATRIX FROM OUTPUTKey for column and row indices:1 BSubtilis2 CTyro3 LViridescens4 MLuteusMatrix 1: Part 11 2 3 4--------------------------------------------------| 1 | 0.00 16.31 26.63 46.18| 2 | 0.00 35.77 54.62| 3 | 0.00 71.66| 4 | 0.00Distance derived from the data: 71.66Distance derived from the NJ tree: 43.75 + 1.86 + 24.55 = 70.16Distance derived from the UPGMA: 15.60 + 13.14 + 28.74 = 57.48(c) Why might UPGMA give a different topology than NJ for these sequences?03-511/711 Computational Genomics and Molecular Biology, Fall 201166. Short questions(a) Using a PSSM instead of a single query sequence in a database search can result in improvedretrieval of distantly related motifs. What might this be the case?(b) Explain why we use pseudocounts to correct for the zero frequency case when building profiles.(c) What are the properties that HMM’s can capture that PSSM’s cannot?(d) How would you use an HMM to calculate a global multiple alignment?(e) What is meant by “unlabeled” data? By “labeled” data?(f) Generally, it is believed that the BLOSUM matrices give better results than the PAM matri-ces. Nevertheless, PAM30 and PAM70 are still used when searching for very closely relatedsequences. Why?(g) What is model surgery? How would you determine whether model surgery is required? If it isrequired, what steps would you take to modify your HMM?(h) Consider the profile HMM shown below:i. M0and M6are silent start and end states. Since these states do not emit a symbol, whatpurpose do they serve?ii. What is the purpose of the I0state?03-511/711 Computational Genomics and Molecular Biology, Fall 201177. Define an HMM H with three states {A, B, C}, and alphabet {1, 2, 3}, initial state probabilities πA=1, πB= 0 and πC= 0 and the following transition and emission probabilities:A B C 1 2 3A 0.25 0.25 0.5 0.5 0.25 0.25B 0.5 0.00 0.5 0.5 0.0 0.5C 0.33 0.33 0.33 0.0 0.5 0.5(a) Draw the state diagram of this HMM and show the transition probabilities.(b) Give all the possible state paths for the sequence O = 1, 3, 1.(c) What is P(O|H )?(d) What is the most probable path? Give the probability of O for this path.(e) For this HMM, would the Viterbi algorithm be a good approximation for Forward algorithm.Why or why not?03-511/711 Computational Genomics and Molecular Biology, Fall 201188. Signal peptides control the entry of proteins into the secretory pathways. These peptides are typi-cally less than 40 residues long, contain a charged N-terminal region, a central stretch of hydrophobicresidues and a cleavage site preceded by three to seven polar residues. The following HMM to rec-ognize signal peptides, proposed by Nielsen and Krogh, contains an n-region module, an h-regionmodule and a c-region module:(a) Which of the three modules, if any, recognize subsequences with geometric length distributions?(b) What is the minimum length of the n-region sequences this HMM will recognize?(c) What is the minimum length of the h-region sequences this HMM will recognize?(d) What is the maximum length of the h-region sequences this HMM will recognize?03-511/711
View Full Document