UMD CMSC 838T - Profile analysis: Detection of distantly related proteins - D547931

Home> Schools> University of Maryland, College Park> Computer Science (CMSC) > CMSC 838T> Profile analysis: Detection of distantly related proteins

UMD CMSC 838T - Profile analysis: Detection of distantly related proteins

School name University of Maryland, College Park

Course Cmsc 838t- Advanced Topics in Programming Languages

Pages 4

Download Save

Unformatted text preview:

Proc. Natl. Acad. Sci. USA Biochemistry VO~. 84, pp. 435.5-4358, July 1987 Profile analysis: Detection of distantly related proteins (amino acid/sequence comparison/protein structure/globin structure/immunoglobulin structure) MICHAEL GRIBSKOV*, ANDREW D. MCLACHLAN:, AND DAVID EISENBERG* *Molecular Biology Institute and Department of Chemistry and Biochemistry. University of California. Los Angeles. Ch 90024: and 'Medical Research Council. Laboratory of Molecular Biology, Hills Road. Cambridge CB? 2QH. England. United Kingdom Communicazed by Paul Boyer. Februar). 17. 1987 (received for review November 19. 1986) ABSTRACT Profile analysis is a method for detecting distantly related proteins by sequence comparison. The basis for comparison is not only the customary Dayhoff mutational- distance matrix but also the results of structural studies and information implicit in the alignments of the sequences of families of similar proteins. This information is expressed in a position-specific scoring table (profile), which is created from a group of sequences previously aligned by structural or sequence similarity. The similarity of any other sequence (target) to the group of aligned sequences (probe) can be tested by comparing the target io the profile using dynamic program- ming algorithms. The profile method differs in two major respects from methods of sequence comparison in common use: (13 Any number of known sequences can be used to construct the profile, allowing more information to be used in the testing of the target than is possible with pairwise alignment methods. (ig The profile includes the penalties for insertion or deletion at each position, which allow one to include the probe secondary structure in the testing scheme. Tests with globin and immu- noglobulin sequences show that profile analysis can distinguish all members of these families from all other sequences in a database containing 3800 protein sequences. Our ability to determjne the three-dimensional structures of proteins has been outztripped by our capacity to determine amino acid sequences from DNA sequences. New ways of inferring structure from sequence are needed, and a prom- ising method is sequence comparison (1-3): if a newly discovered sequence is sufficiently similar to the sequence of a protein of known structure, we can infer that the two proteins have similar structures (e.g.. see refs. 4 and 5). One problem in making such an inference is deciding what degree of sequence similarity is necessary to infer structural simi- larity (6, 7). A different problem, which we address in this paper, is the detection of similar but distantly related pro- teins. This problem is illustrated by the globin family (8). The giobin polypeptide chains from organisms as diverse as humans, insects, and plants are folded in the same general three-dimensional pattern, yet there are only two positions within the some 150 residues of the chain that contain the same amino acid in all globins. That is, this "globin fold" is encoded in many different amino acid sequences, some differing from others in as many as 130 positions. Any single globin sequence represents just one realization of the globin fold. In attempting to decide whether an amino acid sequence encodes the globin fold, we need a pattern or "profile" that represents the fold. The profile described below represents the fold as a position-dependent scoring matrix, giving our best estimate of the likelihood that each amino acid can fit into the known fold. ne publication costs of this article were defrayed in pan by page charge Payment. This article must therefore be hereby marked "adverrisemenr" in accordance with 18 U.S.C. 81734 solely to indicate this fact. Common methods for detection of similarity depend on pairwise alignment of sequences-for example, the dot ma- trix method (9, 10) or dynamic programming methods (11- 14). Another class of methods are the rapid database search- ing methods (15, 16). All of these normally test every se- quence in the database independently against a single probe sequence without using information implicit in the alignments of families of related sequences or including information available from structural studies. [An exception is the family comparison dot matrix method (9), which, however. does not allow for insertion or deletion.] Profile analysis brings in both structural and family information at the expense of a modest increase in computation time. METHODS Construction of the Profile (PROFMAKE). Profile analysis has two steps (Fig. la): (i) construction of the profile with the program PROFMAKE, and (ii)comparison of the profile with a database of sequences or a single sequence (program PROFANAL). The starting point for the creation of a profile is a sequence or group of sequences (the probe). This probe. is usually a group of typical sequences of functionally related proteins that have been aligned by similarity in sequence or three-dimensional structure. Each sequence can be given a weight, which is useful when several of them are very similar. It is also possible to make a profile from a single sequence if additional information is used. The profile is a sequence position-specific scoring matrix M(p,a) composed of 21 columns and N rows (N = length of probe). The row p corresponds to a sequence position of the probe. The first 20 columns of each row specify the score for finding, at that position in the target, each ofthe 20 amino acid residues. An additional column contains a penalty for inser- tions or deletions at that position (Fig. lb). In PROFMAKE, the profile is generated from the probe by using a comparison table derived from the mutational distance matrix (MDM78) of Dayhoff (17, 18). The value of the profile for amino acid u at positionp is M(p,a) = ZitO=, W(p,b) x Y(n;b), where Y(a,b) is Dayhoff s matrix and W(p.b) is a weight for the appearance of amino acid b at position p. This weight is determined as follows: Suppose that amino acid b appears n(b.p) times in position p in the NR probe sequences. Then a simple average weight is given by W(b,p) =

View Full Document


School:
Email:
New Password:
Confirm Password:

UMD CMSC 838T - Profile analysis: Detection of distantly related proteins

Sign up for free to view:

Please select your school