CMU CS 10810 - Homework - D2765218

Home> Schools> Carnegie Mellon University> Computer Science (CS) > CS 10810> Homework

DOC PREVIEW

CMU CS 10810 - Homework

School name Carnegie Mellon University

Course Cs 10810- Computational Genomics

Pages 4

This preview shows page 1 out of 4 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 4 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 4 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

10-810, Computational Molecular Biology: a machinelearning approach: Problem Set 2This problem set is due on Monday, 03/06 in class.Bipartite graphs and whole genome alignmentGiven genomes from two species, A and B we discussed in class a bipartite graph based approachto whole genome alignment. In this problem we will explore some of the computational issuesinvolved in this algorithm.1. a. Given an adjacency matrix for an undirected bipartite graph suggest an algorithm for deter-mining the set of connected components in that graph. What is the running time of the algorithm?1. b. Our goal is to find a mapping for the set of genes in both genomes. Given a bipartite graph asdiscussed above, we denote by ’graph mapping’ an algorithm that assigns to a some (or all) of thegenes in genome A at most one gene in genome B (note that a genes a ∈ A can be assigned a geneb ∈ B only if the graph contains an edge between a and b). ’maximal mapping’ is a graph mappingthat results in the highest number of genes in A being assigned a gene in B. In other words, weare trying to find as many pairs as we can between the two sides of the graph so that no gene isassigned to more than one pair. Present an algorithm for solving this problem (Hint: it is oftenreferred to as ’matching’). What is the running time of the algorithm you presented?NormalizationIn class,we discussed locally weighted linear regression and mentioned that we can use a Gaussiancentered at x to determine the weight that should be assigned to points (genes) around x. Here wewill explore issues related to this weight.2. a. What is the effect of having a large variance for such a Gaussian? A small variance?2. b. Which variance (large or small) would be appropriate to each of the two figures below?Explain.2. c. In gene expression experiments we measure thousands of genes. However, most of thesegenes are expressed at relatively low levels and only a few are expressed at high levels (high R ∗ Gvalues). How can we accommodate such a dataset with the Gaussian weighting method? Explain.1(a) (b)Bi-ClusteringIn this problem you will develop and implement a bi-clustering algorithm. A Bi-cluster is a clustercontaining a subset of the experiments and a subset of the genes. In this problem we will not allowoverlap between the Bi-clusters, though other methods allow such overlap.We will once again rely on bipartite graphs. By answering the questions belowyou will develop(and implement) a method that uses bipartite graphs for bi-clustering.3. a. Assume you are given a time series expression dataset where rows represent genes andcolumns represent time points. How can these be represented using a bipartite graph ? What doedges in this graph correspond to?Since gene expression data contains non-discrete values we will first discretize the data. Everyvalue above (log ratio) 0.9 will be set to 1 and every value below (log ratio) -0.9 will be set to -1.Values between -0.9 and 0.9 will be set to 0.3. b. Using an unweighted bipartite graph (that is, all edges have the same weight of 1) as youhave described above, how can you represent both activation (1) and repression (-1) (remember,we would like to cluster activated genes in a different cluster than the repressed ones)?Assume the graph has a bounded out degree on the left (that is, no node on the left side has morethan d outgoing edges). Also, assume that we are looking for complete subgraphs, that is a subsetof the nodes on the left (l ⊆ A) and a subset of the nodes on the right (r ⊆ B) where each node inl is connected to all nodes in r and vice versa. 3. c. What is the largest possible size of r?3. d. Present a O(n2d) algorithm for finding the maximal complete subgraph (where maximalmeans that it has the most number of nodes from B).While the algorithm you presented in d. is useful, the running time may be too large. Instead,we will use heuristic search to find large subgraph (which will correspond to a bi-cluster). BelowI suggest a possible (simple) heuristic. If you have a better idea you are more than welcomed toimplement your own heuristic, however you will need to explain what exactly you did and why(there will be 5 points bonus for useful heuristics different from the one discussed below).For each of the nodes in v ∈ B we will first determine the set of nodes connected to it in A. Denotethis set by l. Next we determine the set of nodes in B that are connected to nodes in l, denote this2Figure 1: A bipartite graph. The dashed circle contains a complete subgraph in this graph and is agood candidate for a bi-cluster.3set by r (note that this set includes v). We next compute a score for the l, r subgraph (see below).This is process is repeated for all nodes in B and we select the highest scoring subgraph as our firstbi-cluster, remove all edges in this subgraph and repeat this process to find the second bi-clusterand so on. The only problem left is to determine a score for a subgraph.3. e. How can we use binomial distribution to compute a score for a subgraph? Present the formulaand the meaning of each of the parameters you are using.Download the time series dataset from the course website (alphaCycle.txt). This file can be up-loaded directly to Matlab. Also download and the list of gene names (alphaGenes.txt). Each rowin the time series file corresponds to a gene in the gene name file (that is, the first row is for thefirst gene, the second for the second and so forth). Each column in the time series file representsone time point.3. f. Implement the above algorithm (or your own heuristic). Select the first five bi-clusters. Foreach one hand in a plot of the average expression value for genes in that cluster over all time points,the set of time points selected for this bi-cluster and the top 5 GO categories enriched for genes inthat bi-cluster. For the GO enrichment, go to:http://llama.med.harvard.edu/cgi/func/funcassociatePaste the names on the genes in each of your top five clusters (one at a time) and select S. cerevisiaas the organism. For each cluster copy the top five GO categories and their p-values.To summarize, in addition to your answers to the questions in problem 3, you will need to hand inthe following:1. Create a directory with your program, the input files you used and a README file that explainshow to perform f using your program. Email me ([email protected]) a zipped version of this di-rectory.2. For f plots of the average expression for each of the top 5

View Full Document


School:
Email:
New Password:
Confirm Password:

This preview shows page 1 out of 4 pages.

CMU CS 10810 - Homework

Sign up for free to view:

Please select your school