A statistical framework for genomic data fusion

Home> Academic Documents> A statistical framework for genomic data fusion

DOC PREVIEW

This preview shows page 1-2-3 out of 10 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 10 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 10 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 10 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 10 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

BIOINFORMATICSVol. 20 no. 16 2004, pages 2626–2635doi:10.1093/bioinformatics/bth294A statistical framework for genomic data fusionGert R. G. Lanckriet1, Tijl De Bie3, Nello Cristianini4,Michael I. Jordan2and William Stafford Noble5,∗1Department of Electrical Engineering and Computer Science,2Division of ComputerScience, Department of Statistics, University of California, Berkeley 94720, USA,3Department of Electrical Engineering, ESAT-SCD, Katholieke Universiteit Leuven 3001,Belgium,4Department of Statistics, University of California, Davis 95618, USA and5Department of Genome Sciences, University of Washington, Seattle 98195, USAReceived on January 29, 2004; revised on April 7, 2004; accepted on April 23, 2004Advance Access publication May 6, 2004ABSTRACTMotivation: During the past decade, the new focus ongenomics has highlighted a particular challenge: to integratethe different views of the genome that are provided by varioustypes of experimental data.Results: This paper describes a computational frameworkfor integrating and drawing inferences from a collection ofgenome-wide measurements.Each dataset is represented viaa kernel function, which defines generalized similarity relation-ships between pairs of entities, such as genes or proteins.Thekernel representation is both flexible and efficient, and can beapplied to many different types of data. Furthermore, kernelfunctions derived from different types of data can be combinedin a straightforward fashion. Recent advances in the theoryof kernel methods have provided efficient algorithms to per-form such combinations in a way that minimizes a statisticalloss function. These methods exploit semidefinite program-ming techniques to reduce the problem of finding optimiz-ing kernel combinations to a convex optimization problem.Computational experiments performed using yeast genome-wide datasets, including amino acid sequences, hydropathyprofiles, gene expression data and known protein–proteininteractions, demonstrate the utility of this approach. A stat-istical learning algorithm trained from all of these data torecognize particular classes of proteins—membrane proteinsand ribosomal proteins—performs significantly better than thesame algorithm trained on any single type of data.Availability:Supplementary data athttp://noble.gs.washington.edu/proj/sdp-svmContact: [email protected] recent availability of multiple types of genome-wide dataprovidesbiologists with complementaryviewsof asinglegen-omeandhighlightstheneedforalgorithmscapableofunifying∗To whom correspondence should be addressed at: Health Sciences Center,Box 357730, 1705 NE Pacific Street, Seattle, WA 98195, USA.these views. In yeast, for example for a given gene we typ-ically know the protein it encodes, that protein’s similarity toother proteins, its hydrophobicity profile, the mRNA expres-sion levels associated with the given gene under hundreds ofexperimentalconditions, the occurrences of knownorinferredtranscription factor binding sites in the upstream region ofthatgeneandtheidentitiesofmanyoftheproteinsthatinteractwith the given gene’s protein product. Each of these distinctdata types provides one view of the molecular machinery ofthe cell. In the near future, research in bioinformatics willfocus more and more heavily on methods of data fusion.Different data sources are likely to contain different andthus partly independent information about the task at hand.Combiningthosecomplementarypiecesof information can beexpectedtoenhancethe total information about theproblemathand. One problem with this approach, however, is that gen-omic data come in a wide variety of data formats: expressiondata are expressed as vectors or time series; protein sequencedata as strings from a 20-symbol alphabet; gene sequences arestrings from a different (4-symbol) alphabet; protein–proteininteractions are best expressed as graphs and so on.This paper presents a computational and statistical frame-work for integrating heterogeneous descriptions of the sameset of genes. The approach relies on the use of kernel-basedstatisticallearningmethodsthathavealreadyproven tobeveryuseful tools in bioinformatics (Noble, 2004). These methodsrepresentthedataby means ofakernelfunction, whichdefinessimilarities between pairs of genes, proteins and so on. Suchsimilarities can be quite complex relations, implicitly cap-turing aspects of the underlying biological machinery. Onereason for the success of kernel methods is that the kernelfunction takes relationships that are implicit in the data andmakesthem explicit, so that it is easier to detect patterns. Eachkernel function thus extracts a specific type of informationfrom a given dataset, thereby providing a partial descriptionor view of the data. Our goal is to find a kernel that bestrepresents all the information available for a given statisticallearning task. Given many partial descriptions of the data, we2626 Bioinformatics vol. 20 issue 16 © Oxford University Press 2004; all rights reserved.A statistical framework for genomic data fusionsolve the mathematical problem of combining them using aconvex optimization method known as semidefinite program-ming (SDP) (Nesterov and Nemirovsky, 1994; Vandenbergheand Boyd, 1996). This SDP-based approach (Lanckriet et al.,2004) yields a general methodology for combining many par-tial descriptions of data that is statistically sound, as well ascomputationally efficient and robust.In order to demonstrate the feasibility of these methods,we apply them to the recognition of two important groupsof proteins in yeast—ribosomal proteins and membrane pro-teins. The ribosome is a universal protein complex that isresponsible for the translation of mRNA into the correspond-ing amino acid sequence via the universal genetic code. Thestructure of the ribosome has been solved (Schluenzen et al.,2000; Harms et al., 2001), although the precise roles of manyauxiliary factors are not completely understood. Proteins thatparticipateinthe ribosomesharesimilarsequencefeaturesandcorrelated mRNA expression patterns (Brown et al., 2000).Membrane proteins are proteins that anchor in one ofthe various membranes in the cell, including the plasma,ER, golgi, peroxisomal, vacuolar, cellular and mitochondrialinner and outer membranes. Many membrane proteins serveimportant communicative functions between cellular com-partments and between the inside and the outside of the cell(Alberts et al., 1998). Classifying a


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2-3 out of 10 pages.

Please select your school