Unformatted text preview:

Examples of Clustering Biological Examples of Clustering Biological DataData6.892 / 7.93 Spring Term 20026.892 / 7.93 Spring Term 2002March 12, 2002March 12, 2002David GiffordDavid GiffordStudent ProjectsStudent Projects••Ideally groups of four Ideally groups of four ––Two biologistsTwo biologists––Two computer scientists or quantitative expertsTwo computer scientists or quantitative experts••Pick a research project related to the classPick a research project related to the class••You will have access to data, You will have access to data, MatlabMatlab, etc., etc.••Work productWork product––Preliminary outline of plans (5 minutes in class)Preliminary outline of plans (5 minutes in class)––Final presentation (30 minutes)Final presentation (30 minutes)––Talk and slides are gradedTalk and slides are gradedOrdering effectOrdering effectPermutedHierarchical clusteringInitial structureThe problemThe problem“There are 2n-1linear orderings consistent with the structure of the tree. …An optimal linear ordering, one that maximizes the similarity of adjacent elements in the ordering, is impractical to compute.” [Eisen et al, PNAS 98]Problem definitionProblem definitionDenote by Φ the space of the possible linear orderings consistent with the tree.Denote by v1…vnthe tree leaves.Our goal is to find an ordering that maximizes the similarity of adjacent elements:where S is the similarity matrix),(111maxvviniiSφφφ+−=Φ∈∑Computing the optimal similarityComputing the optimal similarityRecursively compute the optimal similarity OT(u,w) for any pair of leaves (u,w) which could be on different corners(leftmost and rightmost) of T. For a leaf u∈T, CT(u) is the set of all possible corner leaves of T when u is on one corner of T.TT2T1wukmOT(u,w) = maxm∈CT1(u),k∈CT2(w)OT1(u,m) + OT2(k,w) + S(m,k)ImprovementImprovementworst time is still O(n4) but …24800 (7 hours)4000 (66 min)200220ordering time4201021020with improvement3210001151500350013000100clustering timenum of leavesRunning time Running time ––biological datasetsbiological datasets3684800800979num of genesEnvironment response* (Yang)Cell cycle – cdc15Cell cycle (Spellman)Different sources (Eisen)type of dataset257(4 min)1802750optimalordering910(15 min)45152415592679clustering timenum of experiments*Results obtained on 700 Mhz Pentium pc with 512M memory running LinuxDoes it help ?Does it help ?Recall the statement we started with -“An optimal linear ordering, one that maximizes the similarity of adjacent elements in the ordering,is impractical to compute.” ?Input Optimal orderingHierarchical clusteringInputHierarchical clustering Optimal orderingResults Results ––hand generated datahand generated dataBiological resultsBiological results• Spellman identified 800 genes as cell cycle regulated in Saccharomyces cerevisiae.• Genes were assigned to five groups termed G1,S,S/G2,G2/M and M/G1which approximate the commonly used cell cycle groups in the literature.• This assignment was performed using a ‘phasing’ method which is a supervised classification algorithm.• In addition to the phasing method, the authors clustered these genes using hierarchical clustering, noting:“There is no simple relationship between these two [phasing and clustering] methods, although there are common features in the results.”Cell Cycle – 24 experiments of cdc15 temperature sensitive mutantHierarchical clusteringOptimal ordering24 experiments of cdc15 temperature sensitive mutant59 experiments, combining cdc15, cdc28 and α factor arrestClustering of the 79 experiments in Eisen’s paper. The numbers to the right of each gene represents the complex to which it belongs according to the MIPS database. Hierarchical clusteringOptimal orderingUsing optimal ordering to identify the different clusters. 24 experiments of cdc15 mutant from Spellman’s paper.0101Clustering DemosClustering Demos••KK--meansmeans––Demo 1: 3 clusters in data, k = 3Demo 1: 3 clusters in data, k = 3––Demo 2: 1 cluster in data, k = 2Demo 2: 1 cluster in data, k = 2••Mixture ModelsMixture Models––Demo 3: 1 cluster in data, k = 2 (same data as Demo 2)Demo 3: 1 cluster in data, k = 2 (same data as Demo 2)––Demo 4: 3 clusters in data, k = 2Demo 4: 3 clusters in data, k = 2––Demo 5: 3 clusters in data, k = 3Demo 5: 3 clusters in data, k = 3••I’ll put the code on the web siteI’ll put the code on the web siteClustering SummaryClustering Summary••Clustering allows you to organize data and see Clustering allows you to organize data and see patternspatterns••Can reduce the dimensionality of data (e.g. PCA)Can reduce the dimensionality of data (e.g. PCA)••Ultimately, we would like to use clusters to explain Ultimately, we would like to use clusters to explain biological phenomenonbiological phenomenon••The first step is classificationThe first step is


View Full Document
Download Examples of Clustering Biological Data
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Examples of Clustering Biological Data and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Examples of Clustering Biological Data 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?