UMD CMSC 838T - Amino Acid Substitution Matrices from an Information Theoretic Perspective

Unformatted text preview:

p J. Mol. Bd- (1991) 219, 555-565 , Amino Acid Substitution Matrices from an Information Theoretic Perspective Stephen F. Altschul National Center for Biotechnology Information National Library of Medicine National Institutes of Health Bethesda, MD 20894, U.S.S. (Received 1 October 1990; accepted 12 February 1991) Protein sequence alignments have become an important tool for molecular biologists. Local alignments are frequently constructed with the aid of a “substitution score matrix” that specifies a score for aligning each pair of amino acid residues. Over the years, many different substitution matrices have been proposed, based on a wide variety of rationales. Statistical results, however, demonstrate that any such matrix is i.mplicitly a “log-odds” matrix, with a specific target distribution for aligned pairs of amino acid residues. In the light of information theory, it is possible to express the scores of a substitution matrix in bits and to see that different matrices are better adapted to different purposes. The most widely used matrix for protein sequence comparison has been the PAM-250 matrix. It is argued that for database searches the PAM-,I20 matrix generally is more appropriate, while for comparing two specific proteins with.suspecte4 homology the PAM-200 matrix is indicated. Examples discussed include the lipocalins, human a,B-glycoprotein, the cystic fibrosis transmembrane conductance regulator and the globins. Keywords: homology; sequence comparison; statistical significance; alignment algorithms; pattern recognition 2. Introduction . General methods for protein sequence comparison were introduced to molecular biology 20 years ago and have since gained widespread use. Most early attempts to measure protein sequence similarity ‘‘$’vused on global sequence alignments, in which 1.vclry residue of the two sequences compared had to participate (Needleman & Wunsch, 19TO; Sellers, 1954; Sankoff C Kruskal, 1983). However, hecause distantly related proteins may share only isolated regions of similarity, e.g. in the vicinity of an active site, attention has shifted to local as opposed to global sequence similarity measures. The basic idea is to consider only relatively conserved sub- xquences; dissimilar regions do not contribute to or ,thtract from the measure of similarity. Local sirni- iurity mar be studied in a variety of ways. These include measures based on the longest matching segments of two sequences with a specified number or proportion of mismatches (Arratia’et al., 1986; Xrratia & Waterman, 1989), as well as methods that compare all segments of a fixed, predefined “window” length (McLachlan, 1971). The most common practice, however, is to consider segments 1;f’ all lengths, and choose those that optimize similarity measure (Smith & Waterman, 1981; Goad & Kanehisa, 1982; Sellers, 1984). This has the advantage of placing no a priori restrictions on the length of the local alignments sought. Most data- base search methods have been based on such local alignments (Lipman & Pearson, 1985; Pearson & Lipman, 1988; Altschul et aE., 1990). To evaluate local alignments, scores generally are assigned to each aligned pair of residues (the set of such scores is called a substitution matrix), as well as to residues aligned with nulls: the score of the overall alignment is then taken to be the sum of these scores. Specifying an appropriate amino acid substitution matrix is central to protein comparison methods and much effort has been devoted to defining, analyzing and refining such matrices (SIcl,achlan, 1971; Dayhoff et al., 1978; Schwartz & lhyhoff, 1975; Feng et al., 1985; Rao, 1987: Risler et al.. 1988). One hope has been to find a matrix best adapted to distinguishing distant evolutionary relationships from chance similarities. Recent mathematical results (Karlin & Altschul, 1990; Karlin et al., 1990) allow all substitution matrices to be \%wed in a common light, and provide a rationale for selecting particular sets of “optimal” scores for local protein sequence comparison.2. The Statistical Significance of Local Sequence Alignments C;lohal alignments are of essentially no IIS~ unless they can aIlow gaps. but this is not true for local alignments. The ability to choose segments with arbiora.r?- starting positions in each sequence means that biologically significant regions frequently may be aligned without the need to introduce gaps. Ij'hile: in general. it. is desirable to allow gaps in low1 dignments. doins so greatly decreases their mathematical tractabilit?. The results described here applv rigorously only t.0 1oc:al alignmen1.s that. lack gap. i.e. to segments 01' ecpa.1 I~ngth from each of the two sequcncrs cmmpsretl. Somc rcwnt di~,ti~- base search tools have focusccl on tintling st~c:h align- ments (-4ltsc:hul & Lipmm, 1990; Alt.sc:hnl r!f 0.1.. 1990). Howrc-er. the statisLics Of optimal s(wrt's fi)r lot:al ;Aignments that include gaps (Smith e! a/., 19S.5; li'aterman et d., 1957) are 1)roully ;~nalogous to those for thc no-gap case (Karlin Rr Altsc:huI: 1990; Karlin el al., IYYO), where more precise resull,s are availaMe. Therdore, one may hope that many of the hic ideas prcsenfsd t)elow will ge~~eraliw 1.0 local alignments that include gaps. Formally, we assume that the aligned amino acids ai and ai are assigned the substitution score sij. Given two protein sequences, the pair of (~11151 length segments that, when aligned, have the greatest. aggregate score we call the Maximal Segment Pair (MSPt). An MSP may be' of any length; its score is the MSP score. Since any two protein sequences, related or un- related, a-ill have some MSP score, it is important to know how great a score one can expect to find simpIy by chance. To address this question one needs some model of chance. The simplest is to assume that in the two proteins compared, the amino acid ai appears randomly with the prob- ability pi. These probabilities are chosen to reflect the observed frequencies of the amino acids in actual proteins. For simplicity of discussion we will assume both proteins share the same amino acid probability distribution; more generally, one can allow them to have


View Full Document

UMD CMSC 838T - Amino Acid Substitution Matrices from an Information Theoretic Perspective

Documents in this Course
Load more
Download Amino Acid Substitution Matrices from an Information Theoretic Perspective
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Amino Acid Substitution Matrices from an Information Theoretic Perspective and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Amino Acid Substitution Matrices from an Information Theoretic Perspective 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?