DOC PREVIEW
Stanford CS 374 - Study Notes

This preview shows page 1-2-3 out of 8 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

A Model of the Statistical Powerof Comparative Genome Sequence AnalysisSean R. EddyHoward Hughes Medical Institute and Department of Genetics, Washington University School of Medicine, Saint Louis, Missouri, United States of AmericaComparative genome sequence analysis is powerful, but sequencing genomes is expensive. It is desirable to be able topredict how many genomes are needed for comparative genomics, and at what evolutionary distances. Here I describea simple mathematical model for the common problem of identifying conserved sequences. The model leads to someuseful rules of thumb. For a given evolutionary distance, the number of comparative genomes needed for a constantlevel of statistical stringency in identifying conserved regions scales inversely with the size of the conserved feature tobe detected. At short evolutionary distances, the number of comparative genomes required also scales inversely withdistance. These scaling behaviors provide some intuition for future comparative genome sequencing needs, such asthe proposed use of ‘‘ phylogenetic shadowing’’ methods using closely related comparative genomes, and thefeasibility of high-resolution detection of small conserved features.Citation: Eddy SR (2005) A model of the statistical power of comparative genome sequence analysis. PLoS Biol 3(1): e10.IntroductionComparative genome sequence analysis is a powerfulmeans of identifying functional DNA sequences by theirevolutionary conservation [1,2,3]. It will be instrumental forachieving the goal of th e Human Genome Project tocomprehensively identify functional elements in the humangenome [4]. How many comparative genome sequences do weneed? Where is the point of diminishing returns, after whichsequencing another koala or bat does not contributesignificant information to human genome analysis? Sincesequencing is expensive and capacity remains limited, onewould like to address this issue as rigorously as possible.Empirical evaluations of candidate comparative genomeshave become important in allocating sequencing resources.Pilot sequencing and analysis in Saccharomyces and Drosophilaspecies were done to choose appropriate species forcomparative genome sequencing [5,6]. A pilot sequencingeffort is underway for a number of mammalian genomes toevaluate their utility for human genome analysis [4]. Giventhe complexity of genomes, empirical studies are necessary.However, one would also like to complement this with higher-level, general insights that are independent of the details ofparticular analysis programs, organisms, and genomic fea-tures.Cooper et al. proposed a mathematical model of oneimportant type of comparative genome analysis [7]. Theyframed a question amenable to quantitative modeling: howmany comparative genomes, and at what distances, arerequired to detect that an individual base in a target genomeis ‘‘ neutral’’ (inferred to be evolving at the neutral rate) asopposed to ‘‘ conserved’’ (inferred to be under purifyingselection)? Their model infers a nucleotide site to beconserved if it is 100% identical to homologous sites in Ncomparative genomes. The key parameters are the independ-ent branch lengths (di) contributed to a phylogeny by each newcomparative genome (i), measured in neutral substitutions persite. More neutral evolutionary distance makes it more likelythat neutral sites will have one or more substitutions in thealignment. Analytical strength increases as a function of thetotal neutral branch length in the phylogeny (Pidi), becausethe probability that a neutral site has no changes in anybranch of the phylogeny (and thus would be misclassified asconserved) is taken to be approximately ePidi. Based on themodel, they concluded that 5.0 neutral substitutions/site oftotal branch length (about 10–20 well-chosen mammaliangenomes) would approach ‘‘ single nucleotide resolution’’ forhuman genome analysis, with a false positive probability (FP)of less than e5.0per invariant site.This model has some limitations that seem serious enoughto question the proposed target of 10–20 mammaliangenomes. Most importantly, it assumes that conserved sitesare invariant. Few conserved features are absolutely invari-ant. If invariance is required to infer conservation, thefraction of truly conserved sites that are wrongly inferred tobe neutral (because a substitution is seen in one of thecomparative genomes) asymptotically approaches one as thenumber of comparative g enomes or their evolutionarydistance increases. We want to consider not just our FP, butour statistical power—our ability to successfully detectfeatures that are conserved.Additionally, single nucleotide resolution may not be themost relevant goal. It is useful to consider single nucleotideresolution as an ultimate limit on comparative analyses—onecan imagine plausible analyses of single bases, and certainlyindividual codons—but we are mostly concerned withidentifying conserved features of greater length, such asexons or transcription factor binding sites.Received June 9, 2004; Accepted November 2, 2004; Published January 4, 2005DOI: 10.1371/journal.pbio.0030010Copyright: Ó 2005 Sean R. Eddy. This is an open-access article distributedunder the terms of the Creative Commons Attribution License, which permitsunrestricted use, distribution, and reproduction in any medium, provided theoriginal work is properly cited.Abbreviations: FP, false positive probability; FN, false negative probability; HKY,Hasegawa-Kishino-Yano; LLR, log likelihood ratioAcademic Editor: Ross C. Hardison, Pennsylvania State University, United States ofAmericaE-mail: [email protected] Biology | www.plosbiology.org January 2005 | Volume 3 | Issue 1 | e100095Open access, freely available onlinePLoSBIOLOGYNonetheless, the level of abstraction introduced by Cooperet al. is attractive. There is a need for better intuitions forplanning comparative genome sequencing. How many morecomparative genomes are needed as one looks for smaller andsmaller conserved features—from exons to regulatory sites tosingle codons or even single nucleotides? How many moregenomes are needed as one uses more and more closelyrelated comparative genomes, in order to improve thechances that homologous lineage-specific features are foundand correctly aligned [8,9]? Precise answers will be elusive,because genome biology is complex, but perhaps there arerough, useful sc aling relationships amongst comparativegenome number,


View Full Document

Stanford CS 374 - Study Notes

Documents in this Course
Probcons

Probcons

42 pages

ProtoMap

ProtoMap

19 pages

Lecture 3

Lecture 3

16 pages

Load more
Download Study Notes
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Study Notes and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Study Notes 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?