UMD CMSC 838T - Methods for assessing the statistical significance of molecular sequence

Unformatted text preview:

Pro<. ,Vut!. Ac-(Id. Sei. USA Vol. 87. pp. 1264-ZZh8. March 1990 Evolution Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes (sequence alignment/protein sequence features) SAMUEL KARL IN^ .4ND STEPHEN F. ALTSCHULSB ’Depanment of Mathematics. Stanford Univcrsitv. Stanford. CA 94305: and :National Center for Biotechnology Informa1ion. National Library of Medicine, National Institutes of Health. Bethesda. MD 20844 Conrribured by Somuel Karlin. December 26, 1989 ABSTRACT An unusual pattern in a nucleic acid or pro- tein sequence or a region of strong similarity shared by two or more sequences may have biological significance. It is therefore desirable to know whether such a pattern can have arisen simply by chance. To identify interesting sequence patterns, appropriate scoring values can be assigned to the individual residues of a single sequence or to sets of residues when several sequences are compared. For single sequences, such scores can reflect biophysical properties such as charge, volume, hydro- phobicity, or secondary structure potential; for multiple se- quences, they can reffect nucleotide or amino acid similarity measured in a wide variety of ways. Using an appropriate random model, we present a theory that provides precise numerical formulas for assessing the statistical significance of any region with high aggregate score. A second class of results describes the composition of high-scoring segments. In certain contexts, these permit the choice of scoring systems which are “optimal” for distinguishing biologically relevant patterns. Examples are given of applications of the theory to a variety of protein sequences, highlighting segments with unusual biolog- ical features. These include distinctive charge regions in tran- scription factors and protooncogene products, pronounced hydrophobic segments in various receptor and transport pro- teins, and statistically significant subalignments involving the recently characterized cystic fibrosis gene. Nucleic acid and protein sequence analysis has become an important tool for the molecular biologist. Determining what is likely or unlikely to occur by chance may help in identifying sequence features of interest for experimental study. A pattern of potential interest in a protein sequence might be an unusual local concentration of charged residues or of poten- tial glycosylation sites; a region of high similarity shared by two or more sequences might be evidence of evolutionary homology or of common function. Statistical methods for evaluating sequence patterns can be based on theoretical models or on permutation reconstruc- tions of the observed data (refs. 1-4; for a recent review on patterns in DNA and amino acid sequences and their statis- tical significance, see ref. 5). Here we use a “random” model appropriate to the data to provide a benchmark for analyzing various data statistics. The independence random model generates successive letters of a sequence in an independent fashion such that letter ujis selected with probabilitypj. In the case of proteins, the pi are usually specified as the actual amino acid frequencies in the observed sequence. A random first-order Markov model prescribes pik as the conditional probability of sampling letter uq followmg letter uj. (In this case the pjk would correspond to the observed diresidue frequencies in a protein sequence.) More complex random models could accommodate more elaborate long-range de- The publication costs of this article were defrdyed in part by page charge payment. Tbhisarticle must therefore be hereby marked “udwertisemenr” in accordance with 18 U.S.C. $1734 solely to indicate this fact. .~ pendencies. For these models, theoretical results (distribu- tional properties) have previously been obtained for a variety of sequence statistics such as the length of the longest run of a given letter or pattern (allowing for a fixed number of errors), the length of the longest word (oligonucleotide, peptide) in a sequence satisfying a prescribed relationship (e.g., r-fold repeat, dyad pairing), and counts and spacings of long repeats (5-14). Several of these analyses have been extended to deal with comparisons within and between multiple sequences, including the identification and statisti- cal evaluation of long common words and multidimensional count occurrence distributions for various word relationships (e.g., refs. 5, 7, 8, 12). One limitation to the applicability of these results has been their inability to allow for properties or mismatches that vary in degree. For example, in describing the charge or hydrophobicity of amino acid residues, it would be more informative to use different scqre levels, and when comparing sequences one may wish to count a mismatch between isoleucine and valine differently than a mismatch between glycine and tryptophan. In this paper we describe a rigorous statistical theory that provides explicit formulas for characterizing significant se- quence configurations with reference to a general scoring scheme. In particular. we determine the distribution of high aggregate segment scores and the distribution of the number of separate segments of significantly high score. A second class of results deals with the letter composition of high- scoring segments, which in certain contexts provides a method for choosing suitable scoring schemes. We will discuss the theory in two primary contexts: (i) the analysis of a single protein sequence with the objective of identifying segments with statistically significant high scores for hydrop- athy strength, charge concentration, size profile, phosphor- ylation potential, or secondary structure propensity; (ii) multiple sequence comparisons for establishing evolutionary histories or protein segments with common function and/or structure. Scoring assignments for nucleotides or amino acids may arise from a variety of considerations. Scoring criteria can be provided by biochemical properties (e.g., charge, hydropho- bicity), physical properties (e.g., molecular weight,


View Full Document

UMD CMSC 838T - Methods for assessing the statistical significance of molecular sequence

Documents in this Course
Load more
Download Methods for assessing the statistical significance of molecular sequence
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Methods for assessing the statistical significance of molecular sequence and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Methods for assessing the statistical significance of molecular sequence 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?