DOC PREVIEW
Data growth and its impact on the SCOP database

This preview shows page 1-2 out of 7 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 7 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 7 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 7 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Published online 13 November 2007 Nucleic Acids Research, 2008, Vol. 36, Database issue D419–D425doi:10.1093/nar/gkm993Data growth and its impact on the SCOP database:new developmentsAntonina Andreeva1,*, Dave Howorth1, John-Marc Chandonia2,3, Steven E. Brenner2,Tim J. P. Hubbard4, Cyrus Chothia5and Alexey G. Murzin11MRC Centre for Protein Engineering, Hills Road, Cambridge CB2 0QH, UK,2Department of Plant and MicrobialBiology, 461A Koshland Hall 3102, University of California, Berkeley, CA 94720-3102,3Physical BiosciencesDivision, Berkeley National Laboratory, 1 Cyclotron Rd, Mail Stop Donner, Berkeley, CA 94720, USA,4Wellcome Trust Sanger Institute, Hinxton, Cambridge CB10 1SA and5MRC Laboratory of Molecular Biology,Hills Road, Cambridge CB2 0QH, UKReceived September 14, 2007; Revised October 19, 2007; Accepted October 22, 2007ABSTRACTThe Structural Classification of Proteins (SCOP)database is a comprehensive ordering of all proteinsof known structure, according to their evolutionaryand structural relationships. The SCOP hierarchycomprises the following levels: Species, Protein,Family, Superfamily, Fold and Class. While keepingthe original classification scheme intact, we havechanged the production of SCOP in order to copewith a rapid growth of new structural data and tofacilitate the discovery of new protein relationships.We describe ongoing developments and new fea-tures implemented in SCOP. A new update protocolsupports batch classification of new protein struc-tures by their detected relationships at Familyand Superfamily levels in contrast to our previoussequential handling of new structural data byrelease date. We introduce pre-SCOP, a preview ofthe SCOP developmental version that enables ear-lier access to the information on new relationships.We also discuss the impact of worldwide StructuralGenomics initiatives, which are producing newprotein structures at an increasing rate, on therates of discovery and growth of protein familiesand superfamilies. SCOP can be accessed at http://scop.mrc-lmb.cam.ac.uk/scop.BACKGROUNDThe Structural Classification of Proteins (SCOP) is adatabase of known structural and evolutionary relation-ships amongst proteins of known structure (1). By analogywith taxonomy, it has been created as a hierarchyof several obligatory levels. The fundamental unit ofclassification is a domain in the experimentally determinedprotein structure. Protein domains are grouped atdifferent levels according to their sequence, structuraland functional relationships. Proceeding from bottom totop, the SCOP hierarchy comprises the following levels:protein Species, representing a distinct protein sequenceand its naturally occurring or artificially created variants;Protein, grouping together similar sequences of essentiallythe same functions that either originate from differentbiological species or represent different isoforms withinthe same organism; Family containing proteins withrelated sequences but typically distinct functions; andSuperfamily bridging together protein families withcommon functional and structural features inferred to befrom a common evolutionary ancestor. Near the root, thebasis of classification is purely structural: structurallysimilar superfamilies with different characteristic featuresare grouped into Folds, which are further arranged intoClasses based mainly on their secondary structure contentand organization. The seven main classes in the latestrelease (1.73, forthcoming) contain 92 927 domainsorganized into 3464 families, 1777 superfamilies and1086 folds. The SCOP domains correspond to 34 495entries in the Protein Data Bank (PDB) (2). Statistics ofthe current and previous releases, summaries and fullhistories of changes and other information are availablefrom the SCOP website (http://scop.mrc-lmb.cam.ac.uk/scop/) together with parseable files encoding all SCOPdata (3). The sequences and structures of SCOP domainsare available from the ASTRAL compendium (4), andhidden Markov models of SCOP domains are availablefrom the SUPERFAMILY database (5).Since the creation of SCOP in 1994, the number ofknown protein structures has grown more than 20-fold,whereas the numbers of SCOP folds, superfamilies andfamilies have increased 4-fold, 5-fold and 7-fold, respec-tively. Besides an increased workload caused by the rapid*To whom correspondence should be addressed. Tel: +44 1223 402132; Fax: +44 1223 402140; Email: [email protected] may also be addressed to Alexey G. Murzin. Tel: +44 402132; Fax: +44 402140; Email: [email protected]ß 2007 The Author(s)This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.data growth, processing these data for SCOP classificationrevealed more subtleties of protein relationships as wellas new types of such relationships. It has becomeincreasingly difficult to update the database while main-taining its original design. Accommodation of largenumbers of new structures and their relationships withinthe SCOP hierarchy required some adjustments of theoriginal classification scheme. In particular, there arelarge superfamilies, which continue to grow, accumulatingmany more new families and proteins. The division ofthese most populous superfamilies into families departedfrom the original SCOP scheme: their families consist ofproteins of very similar structures that may or may nothave a significant global sequence similarity. The proteinsin these families are presumably more closely related toeach other than to proteins in other more structurallydivergent families.A large proportion of new structures come fromworldwide Structural Genomics initiatives, which areproducing them at an increasing rate (6,7). Generallythese structures are functionally uncharacterized whichcomplicates their classifications at the Protein andSuperfamily levels. Therefore, the initial classifications ofsuch structures may be provisional. Discoveries of newrelationships may either confirm these classifications orrevise them. Other complications for a hierarchicalclassification come from the discoveries of probableremote homologies between superfamilies of distinctprotein folds and the non-trivial structural relationshipswithin sequence families (8,9).While keeping the


Data growth and its impact on the SCOP database

Download Data growth and its impact on the SCOP database
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Data growth and its impact on the SCOP database and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Data growth and its impact on the SCOP database 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?