Unformatted text preview:

Methods for Assessing the Statistical Significance of Molecular Sequence Featuresby Using General Scoring SchemesSamuel Karlin; Stephen F. AltschulProceedings of the National Academy of Sciences of the United States of America, Vol. 87, No. 6.(Mar., 1990), pp. 2264-2268.Stable URL:http://links.jstor.org/sici?sici=0027-8424%28199003%2987%3A6%3C2264%3AMFATSS%3E2.0.CO%3B2-TProceedings of the National Academy of Sciences of the United States of America is currently published by National Academyof Sciences.Your use of the JSTOR archive indicates your acceptance of JSTOR's Terms and Conditions of Use, available athttp://www.jstor.org/about/terms.html. JSTOR's Terms and Conditions of Use provides, in part, that unless you have obtainedprior permission, you may not download an entire issue of a journal or multiple copies of articles, and you may use content inthe JSTOR archive only for your personal, non-commercial use.Please contact the publisher regarding any further use of this work. Publisher contact information may be obtained athttp://www.jstor.org/journals/nas.html.Each copy of any part of a JSTOR transmission must contain the same copyright notice that appears on the screen or printedpage of such transmission.The JSTOR Archive is a trusted digital repository providing for long-term preservation and access to leading academicjournals and scholarly literature from around the world. The Archive is supported by libraries, scholarly societies, publishers,and foundations. It is an initiative of JSTOR, a not-for-profit organization with a mission to help the scholarly community takeadvantage of advances in technology. For more information regarding JSTOR, please contact [email protected]://www.jstor.orgMon Mar 24 16:49:28 2008Proc. Natl. Acad. Sci. USA Vol. 87, pp. 2264-2268, March 1990 Evolution Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes (sequence alignrnent/protein sequence features) ?Department of Mathematics, Stanford University, Stanford, CA 94305; and *National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894 Contributed by Samuel Karlin, December 26, 1989 ABSTRACT An unusual pattern in a nucleic acid or pro- tein sequence or a region of strong similarity shared by two or more sequences may have biological significance. It is therefore desirable to know whether such a pattern can have arisen simply by chance. To identify interesting sequence patterns, appropriate scoring values can be assigned to the individual residues of a single sequence or to sets of residues when several sequences are compared. For single sequences, such scores can reflect biophysical properties such as charge, volume, hydro- phobicity, or secondary structure potential; for multiple se- quences, they can reflect nucleotide or amino acid similarity measured in a wide variety of ways. Using an appropriate random model, we present a theory that provides precise numerical formulas for assessing the statistical significance of any region with high aggregate score. A second class of results describes the composition of high-scoring segments. In certain contexts. these ~ermit the choice of scoring svstems which are "optimal" for -distinguishing biologic all^ &levant patterns. Examples are given of applications of the theory to a variety of protein sequences, highlighting segments with unusual biolog- ical features. These include distinctive charge regions in tran- scription factors and protooncogene products, pronounced hydrophobic segments in various receptor and transport pro- teins, and statistically significant subalignments involving the recently characterized cystic fibrosis gene. Nucleic acid and protein sequence analysis has become an important tool for the molecular biologist. Determining what is likely or unlikely to occur by chance may help in identifying sequence features of interest for experimental study. A pattern of potential interest in a protein sequence might be an unusual local concentration of charged residues or of poten- tial glycosylation sites; a region of high similarity shared by two or more sequences might be evidence of evolutionary homology or of common function. Statistical methods for evaluating sequence patterns can be based on theoretical models or on permutation reconstruc- tions of the observed data (refs. 1-4; for a recent review on patterns in DNA and amino acid sequences and their statis- tical significance, see ref. 5). Here we use a "random" model appropriate to the data to provide a benchmark for analyzing various data statistics. The independence random model generates successive letters of a sequence in an independent fashion such that letter a, is selected with probability pJ. In the case of proteins, the pJ are usually specified as the actual amino acid frequencies in the observed sequence. A random first-order Markov model prescribes p,k as the conditional probability of sampling letter ak following letter aj. (In this case the pJk would correspond to the observed diresidue frequencies in a protein sequence.) More complex random models could accommodate more elaborate long-range de- The publication costs of this article were defrayed in part by page charge payment. This article must therefore be hereby marked "advertisement" in accordance with 18 U.S.C. $1734 solely to indicate this fact. pendencies. For these models, theoretical results (distribu- tional properties) have previously been obtained for a variety of sequence statistics such as the length of the longest run of a given letter or pattern (allowing for a fixed number of errors), the length of the longest word (oligonucleotide, peptide) in a sequence satisfying a prescribed relationship (e.g., r-fold repeat, dyad pairing), and counts and spacings of long repeats (5-14). Several of these analyses have been extended to deal with comparisons within and between multiple sequences, including the identification and statisti- cal evaluation of long common words and multidimensional count occurrence distributions for various word relationships (e.g., refs. 5, 7, 8, 12). One limitation to the applicability of these results has been their inability to allow for properties or mismatches that vary in degree. For example, in describing the charge or hydrophobicity of amino


View Full Document

Stanford STATS 345 - Study Notes

Download Study Notes
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Study Notes and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Study Notes 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?