DOC PREVIEW
UMD CMSC 828G - Quantitative assessment of protein function prediction

This preview shows page 1-2 out of 6 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 6 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 6 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 6 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Quantitative assessment of protein functionprediction from metagenomics shotgun sequencesE. D. Harrington*, A. H. Singh*, T. Doerks*, I. Letunic*, C. von Mering*†, L. J. Jensen*, J. Raes*, and P. Bork*‡§*Structural and Computational Biology Unit, European Molecular Biology Laboratory, Meyerhofstrasse 1, 69117 Heidelberg, Germany; and‡Max Delbru¨ckCentre for Molecular Medicine, D-13092 Berlin, GermanyEdited by Michael S. Waterman, University of Southern California, Los Angeles, CA, and approved July 17, 2007 (received for review March 23, 2007)To assess the potential of protein function prediction in environ-mental genomics data, we analyzed shotgun sequences from fourdiverse and complex habitats. Using homology searches as well ascustomized gene neighborhood methods that incorporate inter-genic and evolutionary distances, we inferred specific functions for76% of the 1.4 million predicted ORFs in these samples (83% whennonspecific functions are considered). Surprisingly, these fractionsare only slightly smaller than the corresponding ones in completelysequenced genomes (83% and 86%, respectively, by using thesame methodology) and considerably higher than previouslythought. For as many as 75,448 ORFs (5% of the total), onlyneighborhood methods can assign functions, illustrated here by apreviously undescribed gene associated with the well character-ized heme biosynthesis operon and a potential transcription factorthat might regulate a coupling between fatty acid biosynthesis anddegradation. Our results further suggest that, although functionscan be inferred for most proteins on earth, many functions remainto be discovered in numerous small, rare protein families.fatty acid 兩 heme 兩 neighborhood 兩 environmental genomics 兩metagenome annotationRecent years have seen an explosion in the amount of shotgunsequence data gathered from diverse natural environments.Since 2004, almost 2 billion base pairs resulting f rom publishedlarge-scale metagenomics sequencing projects have been depos-ited [as of January of 2007 (1–8)], eclipsing the entire 764 Mbpof previously sequenced microbial genomes (9). L arge-scaleenvironmental sequencing ef forts have the potential to consid-erably enhance our underst anding of cellular processes, identifyubiquitous as well as unique biological functions in each envi-ronment, and close the gaps in our k nowledge between genotype,phenot ype, and environment. Until the identified ORFs arec orrectly annotated with biological functions, however, we aresimply left w ith a vast amount of information but no contextualk nowledge, analogous to the early days of genome sequencing.Currently, characterizing an unknown sequence involves com-paring it to sequences or protein domains of k nown function inpublic databases, usually by using BLAST (10) or other homol-ogy search tools (11). By applying BLAST-based annotationmethods to newly sequenced genomes, functions can typically beassigned to ⬇70% of the gene products (11–13). Unfortunately,these predictions have been estimated to include 13–15% data-base propagation errors (14) and are only possible if the un-k nown sequence has at least one BLAST hit. To complementhomology-based function prediction, particularly in prokaryotes,additional information from genomic neighborhood (15, 16),phylogenetic profiles (17), gene coexpression (18), and genefusion (19, 20) has been used and c ombined (18, 21). As yet, onlythe exploitation of genomic neighborhood (including gene fu-sions) is feasible in the context of metagenomic shotgun data.In the first large-scale shotgun metagenomics projects fromfour diverse and complex environments [tropical surface waterf rom the Sargasso Sea near Bermuda (2), farm soil fromMinnesota (4), an acidophilic biofilm from an iron ore mine innorthern Californ ia (1), and three samples from ‘‘whale fall’’carcasses on the deep Pacific and A ntarctic ocean floor (4)],functions have been predicted based on sequence similarit y foronly 27–48% of the 1.4 million genes in the dif ferent samples[see supporting information (SI) Table 1]. This implies that forthe majority of proteins in the environment, functions remainunk nown, and no attempt has yet been made to disc over novelfunctionalit y. Furthermore, for each project, different methods,parameters, and even definitions of function were used, whichare of ten not easily accessible to the community, making ac omparison of the dif ferent samples difficult. To be able toc omprehensively predict functions from various metagenomicssamples and to get a consistent overview of function in dif ferentenvironments, we developed a sensitive prediction protoc ol thatc omplements BLAST- and domain-based function predictionswith newly developed and adapted gene neighborhood methods.Applying this protocol to the samples revealed a considerablepredictive power, indicating that function can be inferred formost of the genes on earth; yet the majorit y of functions appearto reside in numerous rare, small protein families that remainlargely unexplored.Results and DiscussionAn Operational Definition of Protein Function. Biological functionis a fuzzy term summarizing a complex concept applicable todif ferent spatial scales (22, 23). At the molecular and (sub-)cellular level, an operational f ramework with clearly definedter ms and thresholds is therefore required when attempting toquantif y protein function. To infer specific function from exist-ing database annotations by using homology, we require simi-larit y to an environmental (partial) ORF ⬎60 bits, correspond-ing roughly to an e-value of 10⫺8in Uniref90 searches (4). Thislevel of sequence similarity is rather strict in terms of homologyidentification but without further analysis may be insufficient todistinguish between paralogs and orthologs, thus not capturingall functional features such as enzy me substrate specificit y. It is,however, sufficient to capture basic functionality.We used a hierarchical classification scheme, favoring manualannot ation, to divide environment al ORFs and, for comparison,124 prokaryotic proteomes into four categories based on thelevel of functional annotation possible: (i) those w ith strongsimilarit y to, or in the genomic neighborhood of, a gene withspecific functional annotation; (ii) those with strong similarity toAuthor contributions: E.D.H., A.H.S., C.v.M., and P.B. designed research; E.D.H., A.H.S., T.D.,L.J.J., and J.R. performed research; E.D.H.,


View Full Document

UMD CMSC 828G - Quantitative assessment of protein function prediction

Documents in this Course
Lecture 2

Lecture 2

35 pages

Load more
Download Quantitative assessment of protein function prediction
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Quantitative assessment of protein function prediction and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Quantitative assessment of protein function prediction 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?