MSU ECE 6990 - CS 4990/6990 Bioinformatics

Unformatted text preview:

Name: _________________________________CS 4990/6990 BioinformaticsExam 3Fall 2003Directions: This is a take-home exam. It is an INDIVIDUAL exam. You must not discuss the contents of this exam with anyone except Dr. Bridges. You are not allowed toreceive assistance from any other student in the class or any individual not in the class including discussions via email, chat rooms, etc. You are allowed to use written archived sources such as books, web pages, etc. in answering questions. If you use any outside documentation other than the notes given in class or your textbook, you must list these sources at the end of the exam and cite them in your answer to the question. If errors or ambiguities are discovered in the test by one student, Dr. Bridges will send email to the entire class to correct the problem. Do not be reluctant to ask for help. I will be available all day on Thursday. Call me before you come by (325-7505) to make sure I will be in my office when you come. I will also be happy to try to answer questions by email. I will check email beginning Wed. afternoon at about 4:00.Due Date: The test is due at 4:30 P.M. on Friday, November 21, 2003 in the CS main office (Butler 300). You must submit a hard copy of the completed exam and you MUST sign the statement at the end of the exam indicating that you did not receive any help that is not allowed. Submit a copy of the cover sheet given at the end of this exam with your examination paper.You may answer these questions using paper and pencil (or ink if done neatly) or with a word processor. In all cases, explain how you derived your answers, why you used the methods you did, and show your work. It is impossible to give partial credit based on a few numbers.Name: _________________________________1. (35 points) Phylogeniesa. Distance basedA simple distance based method for building phylogenetic trees is the UPGMA algorithm (Durbin et al. 1998). Although it has an ugly name standing for Unweighted Pair Group Method using Arithmetic Averages, it is very straightforward. The algorithm is given below:__________________________________________________________________Initialization:Assign each sequence i to its own cluster Ci.Define one leaf of T for each sequence, and place at height 0.Iteration:Determine the two clusters i,j for which dij is minimal. (If thereare several equidistant minimal pairs, pick one randomly.)Define a new cluster k by Ck = Ci  Cj, and define dkl for all l byjijjiiilkiCCCdCddDefine a node k with daughter nodes i and j, and place it at height dij/2.Add k to the current cluster list and remove i and j.Termination:When only two clusters i, j remain, place the root at height dij/2.__________________________________________________________________UPGMA Algorithm Applied to 4 TaxaSuppose that we have 4 taxa, A, B, C, D, for which we wish to build a phylogenetic tree. The pairwise distances between the taxa are listed below:Pair DistanceA B 1.0A C 2.0A D 4.0B C 2.0B D 5.0C D 6.0Name: _________________________________Execution of UPGMA algorithm:Initialization:Set of clusters:C1 = {A} C2 = {B} C3 = {C} C4 = {D} Note that |Ci| is the number of elements in set CiDistances between clusters are the same as the distances of their individualmembers. Note that dij = dji in all cases.Current Clusters (C1, C2, C3, C4)Pair DistanceC1 C21.0C1 C32.0C1 C44.0C2 C32.0C2 C45.0C3 C46.0Iteration 1: The two clusters at a minimum distance are C1 and C2.Define new cluster C5 = C1  C2Compute d53 and d545.4)1()1()1)(0.5()1)(0.4(2)1()1()1)(0.2()1)(0.2(21224114542122311353CCCdCddCCCdCddC1 C2 C3 C4Initial TreeC1 C2 C3 C4New TreeC5Name: _________________________________Current Clusters (C3, C4, C5)Pair DistanceC3 C46.0C5 C32.0C5 C44.5Iteration 2: The two clusters at a minimum distance are C5 and C3.Define new cluster C6 = C5  C3Compute d64 5.7)2()1()2)(5.4()1)(0.6(5355433464CCCdCdd Current Clusters (C4, C6)Termination: The algorithm terminates when only two clusters are left.C1 C2 C3 C4New TreeC5C6C1 C2 C3 C4Final TreeC5C6Name: _________________________________Your assignment is to apply the UPMGA algorithm to the 5 taxa that Dr. Fishbein used in his presentation. Use the distances he gives in slide 46 offor 5 taxa: bird, alligator, lizard, tortoise, snake. Show me how you derivethe tree using the algorithm in a manner similar to that I have used above in the example.b. Character-based (see pages 248 to 250 in your text)Maximum parsimony is a character-based method for creating phylogenetic trees. The first step in the use of this method is a multiple alignment of the sequences that are being used to represent each taxa in the tree. Suppose we have the following alignment of sequences from 4 taxa:Taxa Positions1 2 3 4 5 6 7 8 9 1 A A G A G T G C A 2 G A C G T T C C A 3 T A C A T T G C A 4 C A G G G T G T TWhich of the sites are informative?Draw and compute the length of the three possible trees for each informative position.Determine the total cost for each tree from the costs for all positions.Which is the lowest cost (maximum parsimony) tree?Name: _________________________________2. (35 points) Markov chainsGenes from eucharyotes often have areas upstream from the gene known as CpG islands. These regions are important in gene regulation. CpG islands are just regions in the sequence where there are a larger than usual number of repeats of the dinucleotide pair CG (don’t worry about what the p means). A sequence that contains a CpG island might look like the sequence below: ……ATCCTCGCGACGCGCGCGTTCGCGCGCGTTGCA……Markov chains can be used to distinguish a sequence that comes from a CpG island from a sequence that does not come from a CpG island.a. Suppose that you are going to build a 0th order model of a CpG island using the sequence in red above. (Note that this is NOT sufficient data to build a reliable model; it is an artificial example that we are using for illustration.) Show the 0th order model that would be built based on the subsequence in red.b. Using the model that you developed above, determine the probability that each of the following subsequences came from a CpG island. Make sure you show your work. I am


View Full Document

MSU ECE 6990 - CS 4990/6990 Bioinformatics

Download CS 4990/6990 Bioinformatics
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view CS 4990/6990 Bioinformatics and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view CS 4990/6990 Bioinformatics 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?