DOC PREVIEW
Stanford CS 374 - Subtree power analysis and species selection for comparative genomics

This preview shows page 1-2 out of 6 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 6 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 6 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 6 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Subtree power analysis and species selectionfor comparative genomicsJon D. McAuliffe†, Michael I. Jordan†‡, and Lior Pachter§¶Departments of†Statistics and§Mathematics and‡Division of Computer Science, University of California, Berkeley, CA 94720Communicated by Peter J. Bickel, University of California, Berkeley, CA, April 6, 2005 (received for review December 13, 2004)Sequence comparison across multiple organisms aids in the detec-tion of regions under selection. However, resource limitationsrequire a prioritization of genomes to be sequenced. This priori-tization should be grounded in two considerations: the lineal scopeencompassing the biological phenomena of interest, and the op-timal species within that scope for detecting functional elements.We introduce a statistical framework for optimal species subsetselection, based on maximizing power to detect conserved sites.Analysis of a phylogenetic star topology shows theoretically thatthe optimal species subset is not in general the most evolutionarilydiverged subset. We then demonstrate this finding empirically ina study of vertebrate species. Our results suggest that marsupialsare prime sequencing candidates.hypothesis testing 兩 likelihood ratio 兩 sequence analysisComparative genomic methods can reveal conserved regionsin multiple organisms, including functional elements unde-tected by single-sequence analyses (1, 2). Individual studies havedemonstrated the effectiveness of genomic c omparison for spe-cific regions and elements (3–7). Such suc cesses indicate thatc omparative considerations should play a major role in decisionsabout what unsequenced species to sequence next. For compar-ative purposes, sequencing choices must first of all be guided byspecification of the widest range of species sharing the functionor character in question, which we call the lineal sc ope (8).Bof felli et al. (9) discuss the utility of comparisons in linealsc opes ranging from the primate clade to the vertebrate tree.Most lineal sc opes selected in practice will include far moreext ant species than can be sequenced with today’s resources.Thus, sequencing prioritization is an unavoidable issue, both forsmaller-scale efforts targeting particular regions and for whole-genome projects, whose focus should reflect in part the aggre-gate needs of c omparative analyses. Few studies on comparativemethods provide a quantitative framework for decision-makingabout what to sequence. An exception is the work of Sidow andothers (10, 11): given a set of sequenced organ isms and aninferred phylogeny, Cooper et al . (10) argue that decisionsshould be based on maximizing additive evolutionary divergencein a phylogenetic tree.While additive divergence captures part of the problem un-derlying organism choice, it fails to reflect the inherent tradeof fthat characterizes the problem. On the one hand, the success ofprocedures for assessing conservation does depend on sufficientevolutionary distance among the sequences (4, 5, 12). On theother hand, a given set of species may have diverged too far f romone another to be useful, even when orthology is preserved: inthe limit of large evolutionary distance, conservation and non-c onservation are just as indistinguishable as at distance zero(13). Furthermore, phylogenetic topology affects the power ofc omparative methods in counterintuitive ways.Here, we present a decision-theoretic framework that capturesthese issues, providing a procedure for making systematic quan-tit ative choices of species to sequence. Statistical power is ouroptimalit y criterion for species selection. Thus, we measure theef fectiveness of a species subset directly in terms of error ratesfor detecting and overlook ing conservation at a single ortholo-gous site. Measuring power disentangles effects due to thenumber of species used f rom effects due to relative evolutionarydist ances in the phylogeny. We illustrate these ideas theoreti-cally, in an analysis of a st ar phylogeny, and practically, with anempirically derived phylogeny on 21 represent ative vertebratespecies. The results indicate that adding the dunnart or a closelyrelated marsupial to finished and underway vertebrate sequenceswould most increase the power to detect conservation at single-nucleotide resolution.Decision-Theoretic SettingWe study conservation detection in the following decision-theoretic setting. The data x are the nucleotides at an ortholo-gous site across a set of species, i.e., an ungapped alignmentc olumn. We view these bases as c orresponding to the leaves ofa phylogeny with unobserved ancestral bases. We take as giventhe phylogenetic topology, the Markov substitution processalong the branches, and the branch lengths. The phylogenyinduces the observed-dat a probability distribution p(x; r)asthemarginal distribution on its leaves, which can be evaluatedef ficiently for any x and r (14). The parameter r ⬎ 0isanunk nown global mutation rate shared among all branches. Wechoose two threshold values rN⬎ rCfor r: an actual mut ation rateof at least rNc orresponds by definition to a nonconserved site,whereas a rate no more than rCmeans the site is stronglyc onserved. When rN⬎ r ⬎ rC, the conservation is too weak tointerest us.The decision-theoretic goals are now twofold. First, fixing a setof species, we wish to select a decision r ule␦(x) that declares thesite either nonconserved (␦(x) ⫽ 0) or conserved (␦(x) ⫽ 1) byusing only data from those species. Every nontrivial␦(x) willhave positive probability of making t wo mistakes: when r ⱖ rN,Pr(␦(X) ⫽ 1) is the probabilit y it erroneously detects conserva-tion, and when r ⱕ rC, Pr(␦(X) ⫽ 0) is the probability it overlooksc onservation. Minimizing these probabilities guides our choiceof␦(x). We formulate a Neyman–Pearson hypothesis test (15) ofthe null hypothesis H0: r ⱖ rNversus the alternative hypothesisHA: r ⱕ rC, stipulating a maximum allowed probability␣offalsely rejecting H0(falsely declaring conservation). While c on-trol of this error probability is a central concern (10), we alsowant to find a test␦(x) with large power to detect conservation,or equivalently small probability of overlooking c onservation.The second goal is to maximize the power of␦(x) over thechoice of species subset in the larger phylogeny determined bythe chosen lineal scope. This amounts to choosing a subtree inthe phylogeny, with the chosen species as its leaves. The choiceof subtree


View Full Document

Stanford CS 374 - Subtree power analysis and species selection for comparative genomics

Documents in this Course
Probcons

Probcons

42 pages

ProtoMap

ProtoMap

19 pages

Lecture 3

Lecture 3

16 pages

Load more
Download Subtree power analysis and species selection for comparative genomics
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Subtree power analysis and species selection for comparative genomics and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Subtree power analysis and species selection for comparative genomics 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?