Fall 2009 Computational Genomics and Molecular Biology 1Problem Set 4Collab oration is allowed on this homework. You must hand in homeworks individually and list the namesof the people you worked with. Due Thurssday, December 3rd1. (a) Verify that the rows of the PAM 1 transition matrix sum to one.(b) Verify thatPipiP1[i, i] = 0.99(c) Verify that S[j, k] = S[k, j], where S[·, ·] is the PAM-1 log odds scoring matrix.Fall 2009 Computational Genomics and Molecular Biology 22. In this problem, you will construct a BLOSUM60 substitution matrix from the following aligned block:1: DSDQQD2: DSSQQD3: SSQQDD4: DDQQDD(a) Determine the percent identity between all possible pairs of sequences.(b) Cluster the sequences such that each sequence in the cluster is at least 60% identical to someother sequence in the cluster.Fall 2009 Computational Genomics and Molecular Biology 3(c) Calculate the observed frequencies (axy) for the clustered sequences, using the BLOSUM methodfor adjusting for cluster size.(d) Calculate the expected frequencies (axy) for the clustered sequences, using the BLOSUM methodfor adjusting for cluster size.(e) Use these frequencies to obtain the log odds matrix, as defined by Henikoff and Henikoff.Fall 2009 Computational Genomics and Molecular Biology 43. Substitution matrices:(a) Both the PAM and the BLOSUM substitution matrix families are parametrized by evolutionarydivergence. Which repres ents a greater degree of divergence, BLOSUM80 or BLOSUM62? Why?(b) Which represents a greater degree of divergence, BLOSUM62 or PAM40? Why?(c) What is the interpretation of a positive value in Sx[i, j], the PAM x log odds scoring matrix fora given pair of amino acids i, j?(d) What is the interpretation of a negative value in Sx[i, j]?(e) Consider the PAM30 and PAM250 matrices (shown on the web site). What is the average valueon the diagonal of the PAM 30 matrix (i.e., the average of S30[i, i] over all values of i)?(f) What is the average value on the diagonal of the PAM 250 matrix?Fall 2009 Computational Genomics and Molecular Biology 5(g) Which average diagonal value is larger? How would you explain this in terms of the evolutionarydivergence associated with each of the matrices?(h) Which specific diagonal values are larger in PAM250 than in PAM30? That is, for which aminoacids, i, is S250[i, i] > S30[i, i]? What does that suggest about the functional or structuralproperties of i?Fall 2009 Computational Genomics and Molecular Biology 64. Serine and threonine (S and T) are small, hydrophilic amino acids; asparagine, aspartic acid, glutamicacid, and glutamine (N, D, E, and Q) are large, hydrophilic amino acids; and methionine, isoleucine,leucine and valine (M, I, L, and V) are small, hydrophobic amino acids. Based on the entries in thePAM 250 matrix, which of the following substitutions are you more likely to observe in highly divergedsequences? Show the evidence on which you base your answer. Which property do you think is moreimportant to protein structure: size or hydrophobicity?(a) The replacement of a small, hydrophilic amino acid with a small, hydrophobic amino acid.(b) The r eplacement of a small, hydrophilic amino acid with a large, hydrophilic amino acid.Fall 2009 Computational Genomics and Molecular Biology 75. For ungapped alignments, the expected number of high scoring pairs (HSP’s) with score at least Sfound in the alignment of two random sequences of length m and n isE = Kmne−λSwhere K and λ are constants that can be derived from the theory and depend on the substitutionmatrix. We can define a “normalized” scoreS′=λS − ln Kln 2.(a) Show that the number of HSP’s with score at least S′isE = mn2−S′(b) Derive an expression for S′in terms of E.Fall 2009 Computational Genomics and Molecular Biology 86. Blast problem 1: For this problem, we will search with the sequence of Keratin 18, which is a memberof the Intermediate Filament family. You will perform three BLAST searches with different parametersettings and compare the results.These are the basic steps for all three searches:(i) Go to the BLASTP web site. The BLAST home page is linked off the course syllabus site. Followthe links to find protein-protein BLAST.(ii) The accession ID for Keratin 18 amino acid sequence in this problem is NP000215.1 Enter theaccession ID in the search box.(iii) For all searches, set the following parameters:• Under “Choose s earch set”, select “Non-redundant protein sequences (nr)”.• Under “Organism”, select “Lagomorpha (taxid:9975)”.• Under “Algorithm Parameters,” set “Expect” to 1;• Uncheck “Automatically adjust parameters for short input sequences”;• Set max target sequences to 250;• Set “Compositional adjustments” to “No adjustment”.• Uncheck “Filter for low complexity regions”;• Check “Show results in a new window” so that you can use the same query page for all threesearches.• Use the default for all other parameters, except as specified below.(iv) Run each of the three searches specified below.(v) Once each search is completed, click on “formatting options” at the top of the results window.Select “Use old BLAST report format”. Set “Graphical overview” to 250 and “Alignments” to0 (“Descriptions” should already be set to 250.) Click “Reformat”. If you do not set theseformatting options correctly, you will get incorrect information or some of the information youneed may not be reported.(vi) For each search, print out the results page and hand it in with your problem set. Toreduce the amount of output you need to print, make sure that “Alignments” is setto zero under the “Format” options.(vii) In the reformatted output, you’ll see a color diagram with entitled “Distribution of XXX BlastHits on the Query Sequence.” XXX is the number of matches you obtained. (Note that the website uses the word “hits” ambiguously. I use “matches” to refer to sequences reported in the finaloutput of the search and “hits” to refer to word pairs.)Below that, you’ll see a list of “Sequences producing significant alignments”. For each proteinmatched, you will see a link to the Entrez database record describing this protein, a short one-linedescription of the protein, the normalized bit score for the match (i.e., the equation you derivedin problem 4) and the E-value for the match.At the bottom of the results page, you will see a summary of the BLAST parameters used forthis
View Full Document