DOC PREVIEW
CORNELL CS 726 - Study Notes

This preview shows page 1-2-3 out of 8 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Proc. Natl. Acad. Sci. USAVol. 95, pp. 5913–5920, May 1998Colloquium PaperThis paper was presented at the colloquium ‘‘Computational Biomolecular Science,’’ organized by Russell Doolittle,J. Andrew McCammon, and Peter G. Wolynes, held September 11–13, 1997, sponsored by the National Academyof Sciences at the Arnold and Mabel Beckman Center in Irvine, CA.A unified statistical framework for sequence comparison andstructure comparison(sequence analysisystructure analysisyfold familyydatabase statisticsyprotein evolution)MICHAEL LEVITT*†AND MARK GERSTEIN‡*Department of Structural Biology, Stanford University, Stanford, CA 94305; and‡Molecular Biophysics and Biochemistry Department, P.O. Box 208114,Yale University, New Haven, CT 06520-8114ABSTRACT We present an approach for assessing thesignificance of sequence and structure comparisons byusing nearly identical statistical formalisms for both se-quence and structure. Doing so involves an all-vs.-all com-parison of protein domains [taken here from the StructuralClassification of Proteins (scop) database] and then fittinga simple distribution function to the observed scores. Byusing this distribution, we can attach a statistical signifi-cance to each comparison score in the form of a P value, theprobability that a better score would occur by chance. Asexpected, we find that the scores for sequence matchingfollow an extreme-value distribution. The agreement, more-over, between the P values that we derive from this distri-bution and those reported by standard programs (e.g.,BLASTand FASTA validates our approach. Structure comparisonscores also follow an extreme-value distribution when thestatistics are expressed in terms of a structural alignmentscore (essentially the sum of reciprocated distances betweenaligned atoms minus gap penalties). We find that thetraditional metric of structural similarity, the rms deviationin atom positions after fitting aligned atoms, follows adifferent distribution of scores and does not perform as wellas the structural alignment score. Comparison of the se-quence and structure statistics for pairs of proteins knownto be related distantly shows that structural comparison isable to detect approximately twice as many distant rela-tionships as sequence comparison at the same error rate.The comparison also indicates that there are very few pairswith significant similarity in terms of sequence but notstructure whereas many pairs have significant similarity interms of structure but not sequence.Comparison is a most fundamental operation in biology.Measuring the similarities between ‘‘things’’ enables us togroup them in families, cluster them in trees, and infercommon ancestors and an evolutionary progression. Biologicalcomparisons can take place at many levels, from that of wholeorganisms to that of individual molecules. We are concernedhere with the comparison on the latter level, specifically, withcomparisons of individual protein sequences and structures.(For an example of systematic comparison applied to wholeorganisms, see refs. 1 and 2.)Our overall aim is to describe these two types of comparisonsin a self-consistent, unified framework. For sequence orstructure comparison, each act of comparing one ‘‘entity’’ toanother (that is, either comparing two sequences or twostructures) involves two steps. First, the two objects are alignedoptimally through the introduction of gaps in such a way as tomaximize their residue-by-residue similarity. This operationgenerates some form of total similarity score for the numberof residues matched—traditionally, a percent identity forsequences or an rms for structures, although we will use othermeasures. Second, one has to assess the significance of thisscore in the context of what is known about the proteinscurrently in the database.In earlier papers, Gerstein and Levitt (3, 30) extended thework of Subbiah et al. (4) and Laurents et al. (5) and describedan approach for structural alignment in an analogous fashionto the traditional approach for sequence alignment (6–9). Likesequence alignment, this method involves applying dynamicprogramming to a matrix of similarities between individualresidues to optimize their overall correspondence through theintroduction of gaps.In this paper, we tackle the second of the two steps in proteincomparison: assessing significance. We developed a simpleempirical approach for calculating the significance of analignment score based on doing an all-vs.-all comparison of thedatabase and then curve fitting to the distribution of scores oftrue negatives. This allows us to express the significance of agiven alignment score in terms of a P value, which is the chancethat an alignment of two randomly selected proteins wouldobtain this score. We applied our approach consistently to bothsequences and structures. For sequences, we could compareour fit-based P values with the differently derived statisticalscore from commonly used programs such asBLAST and FASTA(10–13). The agreement we found validated our approach. Forstructure alignment, we followed a parallel route to derive anexpression for the P value of a given alignment in terms of thestructural alignment score.Our work followed on much that recently has been doneassessing the significance of sequence and structure com-parison. One of the major developments in the past few yearshas been the implementation of probabilistic scoringschemes (13–16). These give the significance of a match interms of a P value rather than an absolute, ‘‘raw’’ score (suchas percent identity). This places scores from very differentprograms in a common framework and provides an obviousway to set a significance cutoff (that is, at P 5,0.0001 or0.01%). P values were first used in theBLAST family ofprograms, where they are derived from an analytic model forthe chance of an arbitrary ungapped alignment (10, 17). Pvalues subsequently have been implemented in other pro-grams, such asFASTA and gapped BLAST by using a somewhatdifferent formalism (13, 18, 19).© 1998 by The National Academy of Sciences 0027-8424y98y955913-8$2.00y0PNAS is available online at http:yywww.pnas.org.Abbreviation: scop, Structural Classification of Proteins.†To whom reprint requests should be addressed. e-mail: [email protected] are currently many methods for structural alignment(20–31). Some of these are associated with probabilistic scor-ing schemes. In particular, one method (VAST)


View Full Document

CORNELL CS 726 - Study Notes

Download Study Notes
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Study Notes and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Study Notes 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?