UMD CMSC 838T - Knowledge-based analysis of microarray gene expression data - D1632130

Home> Schools> University of Maryland, College Park> Computer Science (CMSC) > CMSC 838T> Knowledge-based analysis of microarray gene expression data

UMD CMSC 838T - Knowledge-based analysis of microarray gene expression data

School name University of Maryland, College Park

Course Cmsc 838t- Advanced Topics in Programming Languages

Pages 6

Download Save

Unformatted text preview:

Knowledge-based analysis of microarray gene expression data by using support vector machines Michael P. S. Brown*, William Noble Grundyt*, David Lin*, Nello Cristianinis, Charles Walsh Sugnetn, Terrence S. Furey*, Manuel Ares, Jr?, and David Haussler* *Department of Computer Science and TCenter for Molecular Biology of RNA, Department of Biology, University of California, Santa Cruz, Santa Cruz, CA 95064; *Department of Computer Science, Columbia University, New York, NY 10025; §Department of Engineering Mathematics, University of Bristol, Bristol B58 ITR, United Kingdom Edited by David Botstein, Stanford University School of Medicine, Stanford, CA, and approved November 15, 1999 (received for review August 31, 1999) We introduce a method of functionally classifying genes by using gene expression data from DNA microarray hybridization experi- ments. The method is based on the theory of support vector machines (SVMs). SVMs are considered a supervised computer learning method because they exploit prior knowledge of gene function to identify unknown genes of similar function from expression data. SVMs avoid several problems associated with unsupervised clustering methods, such as hierarchical clustering and self-organizing maps. SVMs have many mathematical features that make them attractive for gene expression analysis, including their flexibility in choosing a similarity function, sparseness of solution when dealing with large data sets, the ability to handle large feature spaces, and the ability to identify outliers. We test several SVMs that use different similarity metrics, as well as some other supervised learning methods, and find that the SVMs best identify sets of genes with a common function using expression data. Finally, we use SVMs to predict functional roles for unchar- acterized yeast ORFs based on their expression data. D NA microarray technology provides biologists with the ability to measure the expression levels of thousands of genes in a single experiment. Initial experiments (1) suggest that genes of similar function yield similar expression patterns in microarray hybridization experiments. As data from such exper- iments accumulates, it will be essential to have accurate means for extracting biological significance and using the data to assign functions to genes. Currently, most approaches to the computational analysis of gene expression data attempt to learn functionally significant classifica- tions of genes in an unsupervised fashion. A learning method is considered unsupervised if it learns in the absence of a teacher signal. Unsupervised gene expression analysis methods begin with a definition of similarity (or a measure of distance) between expression patterns, but with no prior knowledge of the true functional classes of the genes. Genes are then grouped by using a clustering algorithm such as hierarchical clustering (1, 2) or self- organizing maps (3). Support vector machines (SVMs) (4-6) and other supervised learning techniques use a training set to specify in advance which data should cluster together. As applied to gene expression data, an SVM would begin with a set of genes that have a common function: for example, genes coding for ribosomal proteins or genes coding for components of the proteasome. In addition, a separate set of genes that are known not to be members of the functional class is specified. These two sets of genes are combined to form a set of training examples in which the genes are labeled positively if they are in the functional class and are labeled negatively if they are known not to be in the functional class. A set of training examples can easily be assembled from literature and database sources. Using this training set, an SVM would learn to discriminate between the members and non-members of a given functional class based on expression data. Having learned the expression features of the class, the SVM could recognize new genes as memberj or as non- members of the class based on their expression data. The SVM 262-267 1 PNAS I January 4,2000 I vol. 97 I no. 1 could also be reapplied to the training examples to identify outliers that may have previously been assigned to the incorrect class in the training set. Thus, an SVM would use the biological information in the investigator's training set to determine what expression features are characteristic of a given functional group and use this infor- mation to decide whether any given gene is likely to be a member of the group. SVMs offer two primary advantages with respect to previously proposed methods such as hierarchical clustering and self- organizing maps. First, although all three methods employ distance (or similarity) functions to compare gene expression measure- ments, SVMs are capable of using a larger variety of such functions. Specifically, SVMs can employ distance functions that operate in extremely high-dimensional feature spaces, as described in more detail below. This ability allows the SVMs implicitly to take into account correlations between gene expression measurements. Sec- ond, supervised methods like SVMs take advantage of prior knowledge (in the form of training data labels) in making distinc- tions between one type of gene and another. In an unsupervised method, when related genes end up far apart according to the distance function, the method has no way to know that the genes are related. We describe here the use of SVMs to class@ genes based on gene expression. We analyze expression data from 2,467 genes from the budding yeast Saccharomyces cerevzkiae measured in 79 different DNA microarray hybridization experiments (1). From these data, we learn to recognize five functional classes from the Munich Information Center for Protein Sequences Yeast Genome Data- base (MYGD) (http://www.mips.biochem.mpg.de/proj/yeast). In addition to SVM classification, we subject these data to analyses by four competing machine learning techniques, including Fisher's linear discriminant (7), Parzen windows (8), and two decision tree learners (9,lO). The SVM method out-performs aU other methods investigated here. We then use SvMs developed for these func- tional groups to predict functional associations for 15 yeast Oms of

View Full Document


School:
Email:
New Password:
Confirm Password:

UMD CMSC 838T - Knowledge-based analysis of microarray gene expression data

Sign up for free to view:

Please select your school