DOC PREVIEW
Stanford CS 374 - ProtoMap

This preview shows page 1-2-3-4-5-6 out of 19 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 19 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 19 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 19 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 19 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 19 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 19 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 19 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

INTRODUCTION METHODS Fig. 1. Fig. 2. Fig. 3. RESULTS Fig. 4. Fig. 5. Fig. 6. Fig. 7. TABLE I TABLE II TABLE III TABLE IV TABLE V TABLE VI TABLE VII TABLE VIII TABLE IX DISCUSSION ACKNOWLEDGMENTS APPENDIX A REFERENCESProtoMap: Automatic Classification of Protein Sequences,a Hierarchy of Protein Families, and Local Mapsof the Protein SpaceGolan Yona,1* Nathan Linial,1and Michal Linial21Institute of Computer Science, Hebrew University, Jerusalem, Israel2Department of Biological Chemistry, Institute of Life Sciences, Hebrew University, Jerusalem, IsraelABSTRACT We investigate the space of all pro-tein sequences in search of clusters of related pro-teins. Our aim is to automatically detect these sets,and thus obtain a classification of all protein se-quences. Our analysis, which uses standard mea-sures of sequence similarity as applied to an all-vs.-all comparison of SWISSPROT, gives a veryconservative initial classification based on the high-est scoring pairs. The many classes in this classifica-tion correspond to protein subfamilies. Subse-quently we merge the subclasses using the weakerpairs in a two-phase clustering algorithm. The algo-rithm makes use of transitivity to identify homolo-gous proteins; however, transitivity is applied re-strictively in an attempt to prevent unrelatedproteins from clustering together. This process isrepeated at varying levels of statistical significance.Consequently, a hierarchical organization of all pro-teins is obtained.The resulting classification splits the protein spaceinto well-defined groups of proteins, which areclosely correlated with natural biological familiesand superfamilies. Different indices of validity wereapplied to assess the quality of our classificationand compare it with the protein families in thePROSITE and Pfam databases. Our classificationagrees with these domain-based classifications forbetween 64.8% and 88.5% of the proteins. It also findsmany new clusters of protein sequences which werenot classified by these databases. The hierarchicalorganization suggested by our analysis reveals finersubfamilies in families of known proteins as well asmany novel relations between protein families. Pro-teins 1999;37:360–378.r1999 Wiley-Liss, Inc.Key words: clustering; protein families; protein clas-sification; sequence alignment; homolo-gous proteinsINTRODUCTIONIn recent years we have witnessed a massive flow of newbiological data. Large-scale, world-wide sequencing projectsreveal new sequences, and many sequences that are addedto the databases are unannotated and await analysis.Currently, 15 complete genomes (yeast, C. elegans, Esch-erichia coli, other eubacteria, and several archaea) areknown. Between 35% and 50% of their proteins have notbeen assigned a function yet.1,2In the absence of structuraldata, analysis necessarily starts with the sequence. Themost effective analyses compare the sequence under studywith all known sequences, in search for close relatives thatmay have been assigned a function. In this way, propertiesof a new protein sequence are extrapolated from those ofits neighbors.Since the early 1970s, algorithms were developed forcomparing protein sequences efficiently and reliably.3–7But even with the best alignment of two protein sequences,the basic question remains: Do they share the same biologicalfunction or not? It is generally accepted that two sequenceswith over 30% identity along much of the sequences are likelyto have the same three-dimensional structure or fold.8–11Proteins of the same fold often have similar biological func-tions. Nevertheless, one encounters many cases of highsimilarity both in fold and function that is not reflected insequence similarity.11–13Such cases are missed by currentsearch methods that just compare sequences.Detecting homology may often help in determining thefunction of new proteins. By definition, homologous pro-teins have evolved from the same ancestor protein. Thedegree of sequence conservation varies among proteinfamilies. However, homologous proteins almost alwayshave the same fold.14Homology is, by definition, a transi-tive relation: If A is homologous to B, and B is homologousto C, then A is homologous to C. This simple observationcan be very effective in discovering homology. However,when applied simple-mindedly, this observation leads tomany pitfalls. Although the common evolutionary origin oftwo proteins is almost never directly observed, we candeduce homology, with a high statistical confidence, giventhat the sequence similarity is significant. This is particu-larly useful in the so-called twilight zone,15where se-quences are identical with, say, 10–25%. Transitivity canbe used to detect related proteins, beyond the power of adirect search.Although transitivity is an attractive concept, it has itsperils: Similarity is not transitive, and similarity does notGrant sponsor: Israel Academy of Sciences; Grant sponsor: HorovitzFoundation of Yissum, The Hebrew University; Grant sponsor: Pro-gram in Mathematics and Molecular Biology.*Correspondence to: Golan Yona, Department of Structural Biology,Fairchild Building D-109, Stanford University, CA 94305. E-mail:[email protected] 9 March 1999; Accepted 28 June 1999PROTEINS: Structure, Function, and Genetics 37:360–378 (1999)r1999 WILEY-LISS, INC.necessarily imply homology.†Therefore, similarity shouldbe used carefully in attempting to deduce homology. Mul-tidomain proteins make the deduction of homology particu-larly difficult: If protein 1 contains domains A and B,protein 2 contains domains B and C, protein 3 containsdomains C and D, then should proteins 1 and 3 beconsidered homologous? This simple example indicates theinadequacy of single-linkage clustering for the purpose ofidentifying protein families within the sequence space.Expert biologists can distinguish significant from insig-nificant similarities. However, the sheer size of currentdatabases rules out an exhaustive manual examination ofall potential homologies. Our goal here is to develop anautomatic method for classification of protein sequencesbased on sequence similarity, through the detection ofgroups of homologous proteins (clusters) and high-levelstructures (groups of related clusters that are connected byweak but consistent sequence similarities) within thesequence space. Such organization would reveal relation-ships among protein families and yield deeper insightsinto the nature of newly discovered sequences.Related


View Full Document

Stanford CS 374 - ProtoMap

Documents in this Course
Probcons

Probcons

42 pages

Lecture 3

Lecture 3

16 pages

Load more
Download ProtoMap
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view ProtoMap and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view ProtoMap 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?