UMD CMSC 838T - Amino acid substitution matrices from protein blocks - D2101919

Home> Schools> University of Maryland, College Park> Computer Science (CMSC) > CMSC 838T> Amino acid substitution matrices from protein blocks

UMD CMSC 838T - Amino acid substitution matrices from protein blocks

School name University of Maryland, College Park

Course Cmsc 838t- Advanced Topics in Programming Languages

Pages 6

Download Save

Unformatted text preview:

Proc. Norl. Acnd. Sci. USA Vol. 89, pp. 10915-10919, November 1992 Biochemistry Amino acid substitution matrices from protein blocks (amino acid sequence/alignment algorithms/data base searching) STEVEN HENIKOFF* AND JOWA G. HENIKOFF Howard Hughes Medical Institute. Basic Sciences Division. Fred Hutchinson Cancer Research Center. Seattle. WA 98104 Comrnunicofed by Wulfer Gilbert. August 28, 1992 (received for review July 13. 1992) ABSTRACT Methods for alignment of protein sequences typicaily measure similarity by using a substitution matrix with scores for all possible exchanges of one amino acid with another. The most widely used matrices are based on the Dayhoff model of evoIutionary rates. Using a different ap- proach, we have derived substitution matrices from about 2000 blocks of aligned sequence segments characterizing more than 50Q groups of related proteins. This led to marked improve- ments in alignments and in searches using queries from each of the groups. Among the most useful computer-based tools in modern biology are those that involve sequence alignments of pro- teins, since these alignments often provide important insights into gene and protein function. There are several different types of alignments: global alignments of pairs of proteins related by common ancestry throughout their lengths, local alignments invohing related segments of proteins, multiple alignments of members of protein families, and alignments mde during data base searches to detect homology. In each case, competing alignments are evaluated by using a scoring scheme for estimating similarity. Although several different scoring schemes have been proposed (1-61, the mutation data matrices of DayhofS (1, 7-9) are generally considered the standard and are often the default in alignment and searching programs. In the Dayhoff model, substitution rates are de- rived from alignments of protein sequences that are at least 85% identical. However, the most common task involving substitution matrices is the detection of much more distant relationships, which are only inferred from substitution rates in the Dayhoff model. Therefore, we wondered whether a better approach might be to use alignments in which these relationships are explicitly represented. An incentive for investigating this possibility is that implementation of an improved matrix in numerous important applications re- quires only trivial effort. METHODS Deriving a Frequency Table from a Data Base of Blocks. Local alignments can be represented as ungapped blocks with each row a different protein segment and each column an aligned residue position. Previously, we described an auto- mated system, PROTOMAT, for obtaining a set of blocks given a group of related proteins (10). This system was applied to a catalog of several hundred protein groups, yielding a data base of >2OOO blocks. Consider a single block representing a conserved region of a protein family. For a new member of this family, we seek a set of scores for matches and mis- matches that best favors a correct alignment with each of the other segments in the block relative to an incorrect align- ment. For each column of the block, we first count the number of matches and mismatches of each type between the The publication costs of this article were defrayed in part by page charge payrnenl. This PKiCk must therefore be hereby marked "odvcrriscmenr" in accordance with 18 U.S.C. 81734 solely to indicate this fact. new sequence and every other sequence in the block. For example, if the residue of the new sequence that aligns with the first column of the first block is A and the column has 9 A residues and 1 S residue, then there are 9 AA matches and 1 AS mismatch. This procedure is repeated for all columns of all blocks with the summed results stored in a table. The new sequence is added to the group. For another new sequence, the same procedure is followed, summing these numbers with those already in the table. Notice that successive addition of each sequence to the group leads to a table consisting of counts of all possible amino acid pairs in a column. For example, in the column consisting of 9 A residues and 1 S residue, there are 8 + 7 + . . . 1 = 36 possible AA pairs, 9 AS or SA pairs, and no SS pairs. Counts of all possible pairs in each column of each block in the data base are summed. So, if a block has a width of H' amino acids and a depth of s sequences, it contributes ws(s - 1)/2 amino acid pairs to the count ((1 x 10 x 9)/2 = 45 in the above example]. The result of this counting is a frequency table listing the number of times each of the 20 + 19 + . . . 1 = 210 different amino acid pairs occurs among the blocks. The table is used to calculate a matrix representing the odds ratio between these observed frequencies and those expected by chance. Computing a Logarithm of Odds (Lod) Matrix. Let the total number of amino acid i,j pairs (1 sj I i I 20) for each entry of the frequency table beJ? Then the observed probability of occurrence for each i, j pau is For the column of 9 A residues and 1 S residue in the example, 9/45 = 0.2. Next we estimate the expected prob&ility+f occurrence for each i, j pair. It is assumed that the observed pair frequencies are those of the population. For the example, 36 pairs have A in both positions of the pair and 9 pairs have A at only one of the two positions. so that the expected probability of A in a pair is [36 + (9/2)]/45 = 0.9 and that of s is (9/2)/45 = 0.1. In general, the probability of occurrence of the ith amino acid in an i, j pair is where fA.4 = 36 and fAS = 9. qAA = 36/45 = 0.8 and qAS = The expected probability of occurrence ec for each i, j pair is then p;pj for i = j and pipj + pipi = Zpipj for i # j. In the example, the expected probability of AA is 0.9 x 0.9 = 0.81, that of AS + SA is 2 X (0.9 X 0.1) = 0.18, and that of SS is 0.1 x 0.1 = 0.01. An odds ratio matrix is calculated where each entry is qu/eu. A lod ratio is then calculated in bit units as sc = log2(qQ/eu). If the observed frequencies are as expected. si = 0; if less than

View Full Document


School:
Email:
New Password:
Confirm Password:

UMD CMSC 838T - Amino acid substitution matrices from protein blocks

Sign up for free to view:

Please select your school