03-511/711 Computational Genomics and Molecular Biology, Fall 2011 1Problem Set 3b Due December 3rdCollaboration is allowed on this homework. You must hand in homeworks individually andlist the names of the people you worked with. Turn in your handwritten answers on the attachedsheets.Monellin is one of several intensely sweet proteins that have been discovered in nature. It ismade up of two short peptide chains (A and B) and is found in the fruit of Dioscoreophyllumcumminsii, a West African plant also known as the serendipity berry. Monellin appears to bedistantly related to cystatins, a family of cysteine protein inhibitors.Monellin is a challenging query (1) because it is short and (2) because the sequence divergencebetween monellin and the cystatins is substantial. You will conduct four searches with this queryusing different parameter values.These are the basic steps for all four searches:1. Go to the BLASTP web site. The BLAST home page is linked off the course syllabus site.Follow the links to find protein-protein BLAST.2. Enter P02882.2, the accession ID for Monellin Chain B, in the search box.3. For all searches, set the following parameters:• Under “Choose search set”, select “Non-redundant protein sequences (nr)” (the default).• Under “Algorithm Parameters,” set “Expect threshold” to 200 to make sure you don’tmiss any matches;• Uncheck “Automatically adjust parameters for short input sequences”;• Set “Max Target Sequences” to 500 to make sure you don’t miss any matches;• Set “Compositional adjustments” to “No adjustment”.• Uncheck “Filter for low complexity regions”;• Check “Show results in a new window” so that you can use the same query page for allfour searches.• Use the default for all other parameters, except as specified below.4. Run each of the four searches specified below.5. Once each search is completed, click on “formatting options” at the top of the results window.On the first line, change “HTML” to “Plain text” in the second pull-down menu and check“Use old BLAST report format”. Set “Alignments” to 0. Click “Reformat”. If you donot set these formatting options correctly, you will get incorrect information or some of theinformation you need may not be reported.03-511/711 Computational Genomics and Molecular Biology, Fall 2011 26. For each search, print out the results page and hand it in with your problem set.To reduce the amount of output you need to print, make sure that “Alignments”is set to zero under the “Format” options.7. In the reformatted output, you’ll see a list of “Sequences producing significant alignments”.For each sequence matched, you will see the database id assigned to this protein, a shortone-line description of the protein, the normalized bit score for the match, and the E-valuefor the match.At the bottom of the results page, you will see a summary of the BLAST parameters usedfor this search (beginning with “Database: All non-redundant ...”). You will compare thisinformation for the four searches.Search 1 Use the default for all parameters not specified above.Search 2 Under “Algorithm parameters”, change the matrix to BLOSUM80. Otherwise, usethe same parameters as in Search 1.Search 3 Under “Algorithm parameters”, set the matrix to PAM30. Otherwise, use the sameparameters as in Search 1.Search 4 Under “Choose search set”, enter Plants in the “Organism” box. Reset the substi-tution matrix to BLOSUM62. Otherwise, use the same parameters as in Search 1.1. For each search, make a table containing the following values:• The matrix used• The length of the database. (Careful, this is not the same as the effective length of thedatabase.)• The length of the query. (Again, not the effective length).• Record the bit score and the E value for Monnelin Chain B (P02882.2); i.e., for thequery matching with itself.• Search for sequence identifier Q10Q47.1, which is a cystatincx. Record the bit score andthe E value for this match.03-511/711 Computational Genomics and Molecular Biology, Fall 2011 32. Information content:(a) For each search, calculate the minimum number of bits needed to distinguish a significantalignment from chance.(b) For each search, estimate the minimum query length needed to achieve the number ofbits you calculated in (i).(c) For Searches 2, 3 and 4, is the minimum number of bits required different than theminimum of number bits required for Search 1? In each case, explain why (or why not).(d) For which searches, if any, is the query sequence long enough to find significant matches,according to the theory? What characteristic of these searches is responsible for this?Explain your reasoning.03-511/711 Computational Genomics and Molecular Biology, Fall 2011 43. Factors that influence bit score and E value(a) Compare the bit score of sequence Q10Q47.1 in Searches 2, 3 and 4, with the bit scoreof Q10Q47.1 in Search 1. Did it increase, decrease or remain unchanged? In each case,explain what you observe in terms of the parameters of the search and what you knowabout the properties of the bit score.(b) Compare the E value of sequence Q10Q47.1 in Searches 2, 3 and 4, with the E valueof Q10Q47.1 in Search 1. Did it increase, decrease or remain unchanged? What is therelationship between changes (or lack thereof) in bit score and E value? In each case,explain what you observe in terms of the parameters of the search and what you knowabout the properties of bit score and E values.(c) Compare the bit score of sequence P02882.2 in Searches 2, 3 and 4, with the bit scoreof P02882.2 in Search 1. Did it increase, decrease or remain unchanged? In each case,explain what you observe in terms of the parameters of the search and what you knowabout the properties of the bit score.03-511/711 Computational Genomics and Molecular Biology, Fall 2011 5(d) Compare the E value of sequence P02882.2 in Searches 2, 3 and 4, with the E valueof P02882.2 in Search 1. Did it increase, decrease or remain unchanged? What is therelationship between changes (or lack thereof) in bit score and E value? In each case,explain what you observe in terms of the parameters of the search and what you knowabout the properties of bit score and E values.(e) How many matches rank higher (are more significant) than Q10Q47.1 in Search 2? Doyou think these higher ranking matches are all true positives? Why or why not?(f) How many matches rank higher (are more
View Full Document