Unformatted text preview:

BIOINFORMATICSVol. 19 no. 14 2003, pages 1787–1799DOI: 10.1093/bioinformatics/btg232CLICK and EXPANDER: a system for clusteringand visualizing gene expression dataRoded Sharan1,∗, Adi Maron-Katz2and Ron Shamir21International Computer Science Institute, 1947 Center St., Suite 600, Berkeley,CA 94704-1198, USA and2School of Computer Science, Tel-Aviv University,Tel-Aviv 69978, IsraelReceived on October 30, 2002; revised on January 28, 2003; accepted on March 28, 2003ABSTRACTMotivation: Microarrays have become a central tool in bio-logical research. Their applications range from functionalannotation to tissue classification and genetic network infer-ence. A key step in the analysis of gene expression data is theidentification of groups of genes that manifest similar expres-sion patterns. This translates to the algorithmic problem ofclustering genes based on their expression patterns.Results: We present a novel clustering algorithm, calledCLICK, and its applications to gene expression analysis. Thealgorithm utilizes graph-theoretic and statistical techniques toidentify tight groups (kernels) of highly similar elements, whichare likely to belong to the same true cluster. Several heur-istic procedures are then used to expand the kernels into thefull clusters. We report on the application of CLICK to a vari-ety of gene expression data sets. In all those applications itoutperformed extant algorithms according to several commonfigures of merit. We also point out that CLICK can be success-fully used for the identification of common regulatory motifs inthe upstream regions of co-regulated genes. Furthermore, wedemonstrate how CLICK can be used to accurately classifytissue samples into disease types, based on their expressionprofiles. Finally, we present a new java-based graphical tool,called EXPANDER, for gene expression analysis and visuali-zation, which incorporates CLICK and several other popularclustering algorithms.Availability: http://www.cs.tau.ac.il/~rshamir/expander/expander.htmlContact: [email protected] INTRODUCTIONMicroarray technologyhas become a central tool in biologicaland biomedical research. This technology provides a global,simultaneous view on the transcription levels of many or allgenes of an organism under a range of conditions or processes.The information obtained by monitoring gene expressionlevels in different developmental stages, tissue types, clinical∗To whom correspondence should be addressed.conditions and different organisms can help in understandinggene function and gene networks, assist in the diagnosticof disease conditions and reveal the effects of medicaltreatments.A key step in the analysis of gene expression data is theidentification of groups of genes that manifest similar expres-sion patterns. This translates to the algorithmic problem ofclustering gene expression data. A clustering problem usu-ally consists of elements and a characteristic vector for eachelement. A measure of similarity is defined between pairsof such vectors. (In gene expression, elements are usu-ally genes, the vector of each gene contains its expressionlevels under each of the monitored conditions, and sim-ilarity can be measured, for example, by the correlationcoefficient between vectors.) The goal is to partition the ele-ments into subsets, which are called clusters, so that twocriteria are satisfied: Homogeneity—elements in the samecluster are highly similar to each other; and separation—elements from different clusters have low similarity toeach other.There is a very rich literature on cluster analysis going backover three decades [cf. (Hartigan, 1975; Everitt, 1993; Mirkin,1996; Hansen and Jaumard, 1997)]. Several algorithmic tech-niques were previously used in clustering gene expressiondata, including hierarchical clustering (Eisen et al., 1998),self-organizingmaps(Tamayo etal.,1999), K-means (Herwiget al., 1999), simulated annealing (Alon et al., 1999), andgraph theoretic approaches: HCS (Hartuv and Shamir, 2000)and CAST (Ben-Dor et al., 1999).We have developed a novel clustering algorithm that wecallCLICK (CLuster IdentificationviaConnectivityKernels).The algorithm does not make any prior assumptions on thenumber of clusters or their structure. At the heart of thealgorithm is a process of recursively partitioning a weightedgraphintocomponentsusingminimum cut computations. Theedge weights and the stopping criterion of the recursion areassigned probabilistic meaning, which gives the algorithmhigh accuracy. The speed of the algorithm is achieved bya variety of experimentally tested heuristic procedures thatshortcut, prepend and append the main process.Bioinformatics 19(14) © Oxford University Press 2003; all rights reserved. 1787R.Sharan et al.CLICK was implemented and tested on a variety of biolo-gical data sets. On three large-scale gene expression data setsthe algorithm outperformed previously published results, thatutilized hierarchical clustering and self organizing maps. Wealso show the utility of CLICK in more advanced biologicalanalyses: the identification of common regulatory motifs inthe promoters of co-regulated genes, and the classification ofsamples into disease types based on their expression profiles.In the latter problem CLICK achieved success ratios of over90% on two real data sets.We present a new java-based graphical tool, calledEXPANDER (EXPression ANalyzer and DisplayER), forgeneexpressionanalysisandvisualization. This softwarecon-tains several clustering methods including CLICK, K-Means,hierarchical clustering and self-organizing maps, all con-trolled via a graphical user interface. It enables visualizing theraw expression data and the clustered data in several ways, aswell as single-cluster and all-clusters evaluations via fitnessscores and functional enrichment tests.A preliminary version of this manuscript, containing anearly version of CLICK and some initial tests, has appearedin Sharan and Shamir (2000).2 PRELIMINARIESLet N ={e1,..., en} be a set of n elements, and let C =(C1,..., Cl) be a partition of N into subsets. Each subsetis called a cluster, and C is called a clustering solution,orsimply a clustering. Two elements eiand ejare called mateswith respect to C if they are members of the same cluster in C.In the gene expression context, the elements are the genes andwe often assume that there exists some correct partition of thegenes into ‘true’ clusters. When C is the true clustering of N ,elements that belong to the same


View Full Document

Princeton COS 557 - CLICK and EXPANDER

Documents in this Course
Load more
Download CLICK and EXPANDER
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view CLICK and EXPANDER and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view CLICK and EXPANDER 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?