Stanford CS 374 - Subtree power analysis and species selection for comparative genomics - D1926485

Home> Schools> Stanford University> Computer Science (CS) > CS 374> Subtree power analysis and species selection for comparative genomics

DOC PREVIEW

Stanford CS 374 - Subtree power analysis and species selection for comparative genomics

School name Stanford University

Course Cs 374- Algorithms in Biology

Pages 6

This preview shows page 1-2 out of 6 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 6 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 6 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 6 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Subtree power analysis and species selectionfor comparative genomicsJon D. McAuliffe†, Michael I. Jordan†‡, and Lior Pachter§¶Departments of†Statistics and§Mathematics and‡Division of Computer Science, University of California, Berkeley, CA 94720Communicated by Peter J. Bickel, University of California, Berkeley, CA, April 6, 2005 (received for review December 13, 2004)Sequence comparison across multiple organisms aids in the detec-tion of regions under selection. However, resource limitationsrequire a prioritization of genomes to be sequenced. This priori-tization should be grounded in two considerations: the lineal scopeencompassing the biological phenomena of interest, and the op-timal species within that scope for detecting functional elements.We introduce a statistical framework for optimal species subsetselection, based on maximizing power to detect conserved sites.Analysis of a phylogenetic star topology shows theoretically thatthe optimal species subset is not in general the most evolutionarilydiverged subset. We then demonstrate this finding empirically ina study of vertebrate species. Our results suggest that marsupialsare prime sequencing candidates.hypothesis testing 兩 likelihood ratio 兩 sequence analysisComparative genomic methods can reveal conserved regionsin multiple organisms, including functional elements unde-tected by single-sequence analyses (1, 2). Individual studies havedemonstrated the effectiveness of genomic c omparison for spe-cific regions and elements (3–7). Such suc cesses indicate thatc omparative considerations should play a major role in decisionsabout what unsequenced species to sequence next. For compar-ative purposes, sequencing choices must first of all be guided byspecification of the widest range of species sharing the functionor character in question, which we call the lineal sc ope (8).Bof felli et al. (9) discuss the utility of comparisons in linealsc opes ranging from the primate clade to the vertebrate tree.Most lineal sc opes selected in practice will include far moreext ant species than can be sequenced with today’s resources.Thus, sequencing prioritization is an unavoidable issue, both forsmaller-scale efforts targeting particular regions and for whole-genome projects, whose focus should reflect in part the aggre-gate needs of c omparative analyses. Few studies on comparativemethods provide a quantitative framework for decision-makingabout what to sequence. An exception is the work of Sidow andothers (10, 11): given a set of sequenced organ isms and aninferred phylogeny, Cooper et al . (10) argue that decisionsshould be based on maximizing additive evolutionary divergencein a phylogenetic tree.While additive divergence captures part of the problem un-derlying organism choice, it fails to reflect the inherent tradeof fthat characterizes the problem. On the one hand, the success ofprocedures for assessing conservation does depend on sufficientevolutionary distance among the sequences (4, 5, 12). On theother hand, a given set of species may have diverged too far f romone another to be useful, even when orthology is preserved: inthe limit of large evolutionary distance, conservation and non-c onservation are just as indistinguishable as at distance zero(13). Furthermore, phylogenetic topology affects the power ofc omparative methods in counterintuitive ways.Here, we present a decision-theoretic framework that capturesthese issues, providing a procedure for making systematic quan-tit ative choices of species to sequence. Statistical power is ouroptimalit y criterion for species selection. Thus, we measure theef fectiveness of a species subset directly in terms of error ratesfor detecting and overlook ing conservation at a single ortholo-gous site. Measuring power disentangles effects due to thenumber of species used f rom effects due to relative evolutionarydist ances in the phylogeny. We illustrate these ideas theoreti-cally, in an analysis of a st ar phylogeny, and practically, with anempirically derived phylogeny on 21 represent ative vertebratespecies. The results indicate that adding the dunnart or a closelyrelated marsupial to finished and underway vertebrate sequenceswould most increase the power to detect conservation at single-nucleotide resolution.Decision-Theoretic SettingWe study conservation detection in the following decision-theoretic setting. The data x are the nucleotides at an ortholo-gous site across a set of species, i.e., an ungapped alignmentc olumn. We view these bases as c orresponding to the leaves ofa phylogeny with unobserved ancestral bases. We take as giventhe phylogenetic topology, the Markov substitution processalong the branches, and the branch lengths. The phylogenyinduces the observed-dat a probability distribution p(x; r)asthemarginal distribution on its leaves, which can be evaluatedef ficiently for any x and r (14). The parameter r ⬎ 0isanunk nown global mutation rate shared among all branches. Wechoose two threshold values rN⬎ rCfor r: an actual mut ation rateof at least rNc orresponds by definition to a nonconserved site,whereas a rate no more than rCmeans the site is stronglyc onserved. When rN⬎ r ⬎ rC, the conservation is too weak tointerest us.The decision-theoretic goals are now twofold. First, fixing a setof species, we wish to select a decision r ule␦(x) that declares thesite either nonconserved (␦(x) ⫽ 0) or conserved (␦(x) ⫽ 1) byusing only data from those species. Every nontrivial␦(x) willhave positive probability of making t wo mistakes: when r ⱖ rN,Pr(␦(X) ⫽ 1) is the probabilit y it erroneously detects conserva-tion, and when r ⱕ rC, Pr(␦(X) ⫽ 0) is the probability it overlooksc onservation. Minimizing these probabilities guides our choiceof␦(x). We formulate a Neyman–Pearson hypothesis test (15) ofthe null hypothesis H0: r ⱖ rNversus the alternative hypothesisHA: r ⱕ rC, stipulating a maximum allowed probability␣offalsely rejecting H0(falsely declaring conservation). While c on-trol of this error probability is a central concern (10), we alsowant to find a test␦(x) with large power to detect conservation,or equivalently small probability of overlooking c onservation.The second goal is to maximize the power of␦(x) over thechoice of species subset in the larger phylogeny determined bythe chosen lineal scope. This amounts to choosing a subtree inthe phylogeny, with the chosen species as its leaves. The choiceof subtree

View Full Document