CORNELL CS 726 - Enriching the Sequence Substitution Matrix by Structural Information

Unformatted text preview:

Enriching the Sequence Substitution Matrix by StructuralInformationOctavian Teodorescu,1Tamara Galor,1Jaroslaw Pillardy,2and Ron Elber1*1Department of Computer Science, Cornell University, Upson Hall 4130, Ithaca, New York 148532Cornell Theory Center, Cornell University, Upson Hall 4130, Ithaca, New York 14853ABSTRACT A fundamental step in homologymodeling is the comparison of two protein se-quences: a probe sequence with an unknown struc-ture and function and a template sequence forwhich the structure and function are known. Thedetection of protein similarities relies on a substitu-tion matrix that scores the proximity of the alignedamino acids. Sequence-to-sequence alignments usesymmetric substitution matrices, whereas thethreading protocols use asymmetric matrices, test-ing the fitness of the probe sequence into the struc-ture of the template protein. We propose a linearcombination of threading and sequence-alignmentscoring function, to produce a single (mixed) scor-ing table. By fitting a single parameter (which is therelative contribution of the BLOSUM 50 matrix andthe threading energy table of THOM2) we obtain asignificant increase in prediction capacity in thetwilight zone of homology modeling (detecting se-quences with <25% sequence identity and with verysimilar structures). For a difficult test of 176 homolo-gous pairs, with no signal of sequence similarity, themixed model makes it possible to detect between 40and 100% more protein pairs than the number ofpairs that are detected by pure threading. Surpris-ingly, the linear combination of the two models isperforming better than threading and than se-quence alignment when the percentage of sequenceidentity is low. We finally suggest that further enrich-ment of substitution matrices, combing more struc-tural descriptors such as exposed surface area, orsecondary structure is expected to enhance thesignal as well. Proteins 2004;54:41– 48.© 2003 Wiley-Liss, Inc.Key words: sequence alignment; threading; fitnessfunction; sequence-to-structure match-ing; energy function; Z-scoreINTRODUCTIONAnnotation and classification of proteins rely on accu-rate and efficient comparison of pairs of proteins. Anessential ingredient of the comparison algorithm is thesubstitution matrix, T. For a pair of amino acid types ␣ and␤ in environments x and y the substitution matrix providesa score for their exchange between the two proteins T ⬅T(␣, x兩␤, y). The score of an alignment (ignoring for themoment penalties for indels) is a sum over all substitutionscores.Environment consists of additional features (x,y)tothedirect score of amino acid substitution, which we denote byT(␣兩␤). For example, it may include (i) multiple sequenceinformation,1(ii) secondary structure data,2(iii) exposedsurface area,3and (iv) many other structural and func-tional fingerprints. Here we consider the informationcontent of only a pair of proteins. Multiple sequenceinformation [feature (i)] is not discussed here and can beadded (in principle) once the scoring of a pair is optimized.A class of environment features is the use of structuralinformation. An alignment of a probe sequence into ashape of another protein is called threading and is usuallyassociated with an energy function4; the energy measuresthe quality of sequence to structure fitness.4The aminoacids are aligned into a known shape and three-dimen-sional interactions are scored, measuring protein stability.The sequence to sequence and sequence to structurealignments are done separately and have their own corre-sponding substitution matrices. For sequence alignmentwe have T(␣兩␤) and for sequence to structure alignment weuse T⬘(␣兩y).It is interesting to note that one type of a substitutionmatrix dominates the scoring of sequence-to-sequencealignments in proteins (BLOSUM 50; Ref. 5), whereasthere is no dominant scoring scheme (energy function) ofmatching sequences into structures. The BLOSUM 50matrix was used as an example, because we have consider-able experience in using it and comparing its results withthreading approaches.4We anticipate a similar enhance-ment in recognition for other sequence-substitution matri-ces; however, we did not do calculations with other matri-ces. Part of the reason for the larger diversity of threadingenergy functions is the higher complexity of three-dimensional interactions compared with one-dimensionalsubstitutions, making it more difficult to find the bestchoice. Another reason is the significant success of BLO-The calculations were performed on Dell Edge cluster of the CornellTheory Center funded by the tri-institutional grant.Grant sponsor: National Science Foundation; Grant number:9988519 to R.E. Grant sponsor: NSERC Canadian fellowship to O.T.Grant sponsor: a tri-institutional grant to Cornell and RockefellerUniversities and Memorial Sloan Kettering Cancer Center to J.P.*Correspondence to: R. Elber, Department of Computer Science,Cornell University, Upson Hall 4130, Ithaca, NY 41583. E-mail:[email protected] 11 December 2002; Accepted 25 March 2003PROTEINS: Structure, Function, and Bioinformatics 54:41–48 (2004)© 2003 WILEY-LISS, INC.SUM 50 in identifying evolutionary relationships com-pared with the much weaker sensitivity of stability ener-gies.Nevertheless, an interesting complementary relation-ship was observed in a number of studies.4At the twilightzone of similarity detection by sequence alignment it ispossible to find remote evolutionary relationships by se-quence-to-structure matching. Threading detects a signifi-cantly smaller number of similar protein pairs comparedwith sequence alignment; however, the set of hits inthreading is not a subset of the sequence alignment hits.Therefore, threading alone is a potentially useful toolwhen sequence alignment fails to recover a signal.Merging threading and sequence signals is done afterseparate alignments and scoring was performed. The rawscores or the statistical significance measures (e.g., theZ-scores6) are combined in an empirical formula7or in aneural net8to take advantage of the complementarities ofthe two techniques.Here we propose another combination of sequence andstructure signals at the level of the substitution matrix. Anew substitution matrix, M(␣兩␤,y), is defined as a linearcombination of T(␣兩␤) and T⬘(␣兩y):M共␣兩␤,y兲 ⫽ ␭T共␣兩␤兲 ⫹ 共1 ⫺ ␭兲T⬘共␣兩y兲 (1)The parameter ␭ is a constant mixing term


View Full Document

CORNELL CS 726 - Enriching the Sequence Substitution Matrix by Structural Information

Download Enriching the Sequence Substitution Matrix by Structural Information
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Enriching the Sequence Substitution Matrix by Structural Information and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Enriching the Sequence Substitution Matrix by Structural Information 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?