DOC PREVIEW
Protein secondary structure

This preview shows page 1-2-3 out of 9 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 9 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 9 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 9 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 9 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

BIOINFORMATICSVol. 20 no. 10 2004, pages 1603–1611doi:10.1093/bioinformatics/bth132Protein secondary structure: entropy,correlations and predictionGavin E. Crooks∗and Steven E. BrennerDepartment of Plant and Microbial Biology, University of California, 111 Koshland HallNo. 3102, Berkeley, CA 94720-3102, USAReceived on November 4, 2003; revised and accepted on November 26, 2003Advance Access publication February 26, 2004ABSTRACTMotivation: Is protein secondary structure primarily deter-mined by local interactions between residues closely spacedalong the amino acid backbone or by non-local tertiary inter-actions? To answer this question, we measure the entropydensities of primary and secondary structure sequences, andthe local inter-sequence mutual information density.Results: We find that the important inter-sequence interac-tions are short ranged, that correlations between neighbor-ing amino acids are essentially uninformative and that onlyone-fourth of the total information needed to determine thesecondary structure is available from local inter-sequencecorrelations. These observations support the view that themajority of most proteins fold via a cooperative processwhere secondary and tertiary structure form concurrently.Moreover, existing single-sequence secondary structure pre-diction algorithms are almost optimal, and we should notexpect a dramatic improvement in prediction accuracy.Availability: Both the data sets and analysis code are freelyavailable from our Web site at http://compbio.berkeley.edu/Contact: [email protected] secondary structure of a protein is a summary of thegeneral conformation and hydrogen bonding pattern of theamino acid backbone (Frishman and Argos, 1995). This struc-ture is frequently simplified to a sequence (one element perresidue) of helixes (H), extended strands (E) and unstructuredloops (L). It has long been recognized that each residue’ssecondary structure is appreciably correlated with the localamino acid sequence (Szent-Gyorgyi and Cohen, 1957) andthat these correlations may be used to predict the second-ary structure (Rost, 2001; Przybylski and Rost, 2002) or as acontribution to threading potentials (Alexandrov et al., 1996;McGuffin and Jones, 2003) and other tertiary structure pre-diction algorithms (Bowie et al., 1991). The effectiveness oflocal secondary structure prediction, and the utility of sec-ondary structure potentials, depends upon the extent to which∗To whom correspondence should be addressed.Primary: YDPEEHHKLSHEAESLPSVVISSQAAGNAVMMGAGYFSPSecondary: LLHHHHHHHHHHHHLLLEEELLHHHHHHHHHHHHLLLLLFig. 1. A protein’s amino acid sequence is correlated with the cor-responding secondary structure sequence, represented here by asequence of helixes (H), extended strands (E) and unstructured loops(L). For example, alanines (A) are typically associated with helixes,while glycines (G) are often located near helix breaks. Also notethat secondary structure is strongly persistent. Helixes, e.g. are onaverage about 10 residues long (Schmidler et al., 2000).a protein’s structure, particularly the secondary structure,is determined by local, short-ranged interactions betweenresidues closely spaced along the backbone, as opposed tonon-local or long-ranged tertiary interactions.The strength, organization and relative importance of localsequence–structure interactions can be determined with a stat-istical analysis of the corpus of known protein structures. Wetreat the primary and secondary structures of a protein as ran-dom sequences comprising either the 20 letter amino acid orthe three letter extended strand/helix/other (EHL) structurealphabets, as shown in Figure 1. These sequences containsubstantiallocalsequenceandinter-sequencecorrelationsthatcan be quantified using entropic measures. To ensure accur-ate results, we employ a large, carefully curated collection ofprotein structures derived from the structural classification ofproteins (SCOP) (Murzin et al., 1995; Lo Conte et al., 2002)database, which contains 2853 sequences.Sequence informationEntropy is a measure of the information needed to describe arandomvariable. Specifically, theentropy, H(X), of a discreterandom variable X, measured in bits, is defined asH(X) =−Elog2P(X)=− x∈XP(x)log2P(x), (1)where X is the alphabet, the set of allowed states, x is anelement of X ,E(X) is the expectation and P(x) is the prob-abilityofstate x. Themaximumachievableentropyis log2|X |bits, where |X | is the size of the alphabet. For a modernBioinformatics 20(10) © Oxford University Press 2004; all rights reserved. 1603G.E.Crooks and S.E.Brennerintroduction to information theory, see the book by Coverand Thomas (1991).When considering the entropy of a collection of variables,it is important to take into account inter-variable correla-tions. For a statistically homogeneous random sequence withlocal correlations, the appropriate information measure is theentropy density, hµ, the rate at which the entropy of thesequence increases with length:hµ= limL→∞H(XL) − EhL.(2)Here, H(XL) is the entropy of sequence fragments, XL,oflength L. The non-extensive excess entropy, Eh, is the quant-ity of information explained away by taking account of inter-site correlations. The entropy density is also referred to as theentropy rate or metric entropy (Cover and Thomas, 1991).A convenient measure of correlation between two discreterandom variables X and Y is the mutual information, I(X; Y),defined asI(X; Y) = H(X)+ H(Y)− H(X, Y), (3)= x∈X ,y∈YP(x, y)log2P(x, y)P(x)P(y),(4)where P(x, y) is the joint probability of observing states xand y. If the random variables are independent [P(x, y) =P(x)P(y)], then the mutual information achieves its lowerbound of zero. Mutual information cannot exceed the entropyof either variable, and this upper bound is reached when thevariables are perfectly correlated [P(x, y) = P(x) = P(y)].The appropriate entropic correlation measure for a pair ofstatistically homogeneous random sequences is the mutualinformation density, iµ,iµ= limL→∞I(XL;YL) − EiL. (5)Here, Eiis the excess mutual information.When we consider the correlations between three randomvariables, it is often useful to consider I(X; Y |Z), the condi-tionalmutualinformation (Coverand Thomas, 1991) of X andY , given a third variable Z. This quantity can be convenientlydefined in terms of mutual information:I(X; Y |Z) = I(X; Y)+ I(X, Y ; Z) − I(X; Z) − I(Y; Z).(6)Conditioning


Protein secondary structure

Download Protein secondary structure
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Protein secondary structure and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Protein secondary structure 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?