UMD CMSC 838T - Gene Clustering Using Self-Organizing Maps and Particle Swarm Optimization - D1778035

Home> Schools> University of Maryland, College Park> Computer Science (CMSC) > CMSC 838T> Gene Clustering Using Self-Organizing Maps and Particle Swarm Optimization

UMD CMSC 838T - Gene Clustering Using Self-Organizing Maps and Particle Swarm Optimization

School name University of Maryland, College Park

Course Cmsc 838t- Advanced Topics in Programming Languages

Pages 10

Download Save

Unformatted text preview:

Gene Clustering Using Self-Organizing Maps and Particle Swarm OptimizationXiang Xiao Electrical and Computer Eng. Indiana University Purdue University Indianapolis Indianapolis, IN 46202 Ernst R. Dow Eli Lilly and Company Eli Lilly Corporate Center Indianapolis, IN 46285 Russell Eberhart Electrical and Computer Eng. Indiana University Purdue University Indianapolis Indianapolis, IN 46202Zina Ben Miled Electrical and Computer Eng. Indiana University Purdue University Indianapolis Indianapolis, IN 46202 Robert J. Oppelt Eli Lilly and Company Eli Lilly Corporate Center Indianapolis, IN 46285 Abstract Gene clustering, the process of grouping related genes in the same cluster, is at the foundation of different genomic studies that aim at analyzing the function of genes. Microarray technologies have made it possible to measure gene expression levels for thousand of genes simultaneously. For knowledge to be extracted from the datasets generated by these technologies, the datasets have to be presented to a scientist in a meaningful way. Gene clustering methods serve this purpose. In this paper, a hybrid clustering approach that is based on Self-Organizing Maps and Particle Swarm Optimization is proposed. In the proposed algorithm, the rate of convergence is improved by adding a conscience factor to the Self-Organizing Maps algorithm. The robustness of the result is measured by using a resampling technique. The algorithm is implemented on a cluster of workstations. 1. Introduction Gene clustering methods are essential in the analysis of gene expression data collected over time and under different experimental conditions. Microarray expression data for thousands of genes can now be collected efficiently and at a relatively low cost. Clustering was used, for example, to understand the functional differences in cultured primary hepatocytes relative to the intact liver[1]. In another study, clustering techniques were used on gene expression data for tumor and normal colon tissue probed by oligonucleotide arrays [2]. This latter study was based on the expression data of a collection of 40 tumor and 22 normal colon tissue samples. A two-way clustering algorithm was applied to the data and resulted in a separation of cancerous from non-cancerous tissue and a strong indication that coregulated genes cluster together. Gene clustering techniques were also used to analyze temporal gene expression data during rat central nervous system development [3]. This study was based on the temporal mRNA expression of 112 genes and it established that most of the patterns are clustered into five basic expression patterns. Clustering aggregates similar input patterns into distinct, mutually exclusive subsets referred to as clusters. As stated by Anderberg [4], “the objective is to group the data units or the variables into clusters such that elements within a cluster have a high degree of ‘natural association’ among themselves while the clusters are ‘relatively distinct’ from one another.” Clustering is a two-phase process. In the first phase, the number of clusters in the data is determined or assumed. The second phase assigns each data point (pattern) to a single cluster. Several clustering algorithms such as hierarchical clustering [3,5], Principle Component Analysis based (PCA-based) [6,7], genetic algorithms [8], and artificial neural networks [1,9,10,11], have been used to cluster gene expression data. Particle Swarm Optimization [12] is another computational intelligence method that, to the best of our knowledge, has not been used to cluster gene expression data. In this paper Particle Swarm Optimization is used with Self-Organizing Maps to cluster genes. In Section 2, Self-Organizing Maps and Particle Swarm Optimization are reviewed and the proposed hybrid clustering approach that uses both of these algorithms is discussed. Descriptions of the rat [1] and yeast [13] benchmark data sets that are used in this paper are described in Section 3. Section 4 includes the clusteringresult for the proposed hybrid algorithm and a comparison of this result to that of clustering using either Self-Organizing Maps or Particle Swam Optimization. Related work is the subject of Section 5. Section 6 of this paper summarizes the major findings of this study. 2. Hybrid SOM and PSO Algorithm In this section the advantages and disadvantages of Self-Organizing Maps (SOM) and Particle Swarm Optimization (PSO) clustering algorithms are investigated. Several aspects of these algorithms are discussed including issues dealing with large gene expression data sets and execution time. In addition, two complementary techniques to clustering algorithms, namely conscience and resampling, are also discussed. Finally, the proposed hybrid SOM/PSO algorithm is introduced. 2.1 Self Organizing Maps Self-Organizing Maps (SOM) were introduced in [14]. The foundation of this algorithm comes from the orderly mapping of information in the cerebral cortex. With SOM, high dimensional datasets are projected onto a one- or two-dimensional space. Typically, a SOM has a two-dimensional lattice of neurons and each neuron represents a cluster. The adaptation (learning) process of SOM is unsupervised. All neurons compete for each input pattern; the neuron that is chosen for the input pattern wins it. Only the winning neuron is activated (winner-takes-all). The winning neuron updates itself and neighbor neurons to approximate the distribution of the patterns in the input dataset. After the adaptation process is complete, similar clusters will be close to each other (i.e., topological ordering of clusters). Topological ordering helps in detecting both distinct and similar clusters quickly. The SOM algorithm is very efficient in handling large datasets. The SOM algorithm is also robust even when the data set is noisy [15]. However, SOM has a few disadvantages. For example, the number of clusters needs to be specified. As previously mentioned, clustering is a two-phase process: determining the number of clusters and clustering the data. Determining the number of clusters is not trivial since the characteristics of the data set are usually not known a priori. However, as suggested in [16], this can be overcome by running the algorithm with varying numbers of clusters and selecting the most appropriate clustering result according to a figure of merit. SOM has been widely used as a clustering method. In [10], SOM was used to

View Full Document