Princeton COS 557 - CLICK and EXPANDER (13 pages)

Previewing pages 1, 2, 3, 4 of 13 page document View the full content.
View Full Document

CLICK and EXPANDER



Previewing pages 1, 2, 3, 4 of actual document.

View the full content.
View Full Document
View Full Document

CLICK and EXPANDER

25 views


Pages:
13
School:
Princeton University
Course:
Cos 557 - Analysis & Vis of Lg-Scale Genomic Data
Analysis & Vis of Lg-Scale Genomic Data Documents
Unformatted text preview:

BIOINFORMATICS Vol 19 no 14 2003 pages 1787 1799 DOI 10 1093 bioinformatics btg232 CLICK and EXPANDER a system for clustering and visualizing gene expression data Roded Sharan1 Adi Maron Katz2 and Ron Shamir2 1 International Computer Science Institute 1947 Center St Suite 600 Berkeley CA 94704 1198 USA and 2 School of Computer Science Tel Aviv University Tel Aviv 69978 Israel Received on October 30 2002 revised on January 28 2003 accepted on March 28 2003 ABSTRACT Motivation Microarrays have become a central tool in biological research Their applications range from functional annotation to tissue classification and genetic network inference A key step in the analysis of gene expression data is the identification of groups of genes that manifest similar expression patterns This translates to the algorithmic problem of clustering genes based on their expression patterns Results We present a novel clustering algorithm called CLICK and its applications to gene expression analysis The algorithm utilizes graph theoretic and statistical techniques to identify tight groups kernels of highly similar elements which are likely to belong to the same true cluster Several heuristic procedures are then used to expand the kernels into the full clusters We report on the application of CLICK to a variety of gene expression data sets In all those applications it outperformed extant algorithms according to several common figures of merit We also point out that CLICK can be successfully used for the identification of common regulatory motifs in the upstream regions of co regulated genes Furthermore we demonstrate how CLICK can be used to accurately classify tissue samples into disease types based on their expression profiles Finally we present a new java based graphical tool called EXPANDER for gene expression analysis and visualization which incorporates CLICK and several other popular clustering algorithms Availability http www cs tau ac il rshamir expander expander html Contact roded icsi berkeley edu 1 INTRODUCTION Microarray technology has become a central tool in biological and biomedical research This technology provides a global simultaneous view on the transcription levels of many or all genes of an organism under a range of conditions or processes The information obtained by monitoring gene expression levels in different developmental stages tissue types clinical To whom correspondence should be addressed Bioinformatics 19 14 Oxford University Press 2003 all rights reserved conditions and different organisms can help in understanding gene function and gene networks assist in the diagnostic of disease conditions and reveal the effects of medical treatments A key step in the analysis of gene expression data is the identification of groups of genes that manifest similar expression patterns This translates to the algorithmic problem of clustering gene expression data A clustering problem usually consists of elements and a characteristic vector for each element A measure of similarity is defined between pairs of such vectors In gene expression elements are usually genes the vector of each gene contains its expression levels under each of the monitored conditions and similarity can be measured for example by the correlation coefficient between vectors The goal is to partition the elements into subsets which are called clusters so that two criteria are satisfied Homogeneity elements in the same cluster are highly similar to each other and separation elements from different clusters have low similarity to each other There is a very rich literature on cluster analysis going back over three decades cf Hartigan 1975 Everitt 1993 Mirkin 1996 Hansen and Jaumard 1997 Several algorithmic techniques were previously used in clustering gene expression data including hierarchical clustering Eisen et al 1998 self organizing maps Tamayo et al 1999 K means Herwig et al 1999 simulated annealing Alon et al 1999 and graph theoretic approaches HCS Hartuv and Shamir 2000 and CAST Ben Dor et al 1999 We have developed a novel clustering algorithm that we call CLICK CLuster Identification via Connectivity Kernels The algorithm does not make any prior assumptions on the number of clusters or their structure At the heart of the algorithm is a process of recursively partitioning a weighted graph into components using minimum cut computations The edge weights and the stopping criterion of the recursion are assigned probabilistic meaning which gives the algorithm high accuracy The speed of the algorithm is achieved by a variety of experimentally tested heuristic procedures that shortcut prepend and append the main process 1787 R Sharan et al CLICK was implemented and tested on a variety of biological data sets On three large scale gene expression data sets the algorithm outperformed previously published results that utilized hierarchical clustering and self organizing maps We also show the utility of CLICK in more advanced biological analyses the identification of common regulatory motifs in the promoters of co regulated genes and the classification of samples into disease types based on their expression profiles In the latter problem CLICK achieved success ratios of over 90 on two real data sets We present a new java based graphical tool called EXPANDER EXPression ANalyzer and DisplayER for gene expression analysis and visualization This software contains several clustering methods including CLICK K Means hierarchical clustering and self organizing maps all controlled via a graphical user interface It enables visualizing the raw expression data and the clustered data in several ways as well as single cluster and all clusters evaluations via fitness scores and functional enrichment tests A preliminary version of this manuscript containing an early version of CLICK and some initial tests has appeared in Sharan and Shamir 2000 2 PRELIMINARIES Let N e1 en be a set of n elements and let C C1 Cl be a partition of N into subsets Each subset is called a cluster and C is called a clustering solution or simply a clustering Two elements ei and ej are called mates with respect to C if they are members of the same cluster in C In the gene expression context the elements are the genes and we often assume that there exists some correct partition of the genes into true clusters When C is the true clustering of N elements that belong to the same true cluster are simply called mates The input data for a clustering problem is typically


View Full Document

Access the best Study Guides, Lecture Notes and Practice Exams

Loading Unlocking...
Login

Join to view CLICK and EXPANDER and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view CLICK and EXPANDER and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?