Stanford STATS 345 - Sequence Data and Multiple Alignments - D1804610

Home> Schools> Stanford University> Statistics (STATS) > STATS 345> Sequence Data and Multiple Alignments

Stanford STATS 345 - Sequence Data and Multiple Alignments

Course Stats 345- Statistical And Machine Learning Methods For Genomics (bio 268, Biomedin 245, Cs 373, Gene 245)

Pages 41

Download Save

Unformatted text preview:

Supplementary Material1 Sequence Data and Multiple AlignmentsThe five vertebrate, four insect, two worm, and seven yeast genomes used in the analysis are summarizedin Table S1, and the four genome-wide multiple alignments are summarized in Table S2. All four multiplealignments are downloadable as supplementary data from http://www.cse.ucsc.edu/∼acs/conservation. Con-tinually updated versions are also browsable and downloadable via the UCSC Genome Browser (http://gen-ome.ucsc.edu).2 Prediction of Conserved Elements Using PhastCons2.1 The ModelLet the nonconserved phylogenetic model be defined as ψn= (Q, π, τ , β), where Q is a 4 × 4 substitutionrate matrix indicating the instantaneous rate of replacement of each nucleotide for each other nucleotide, πis a vector of background (equilibrium) probabilities for the four bases, τ is a binary tree representing thetopology of the phylogeny, and β is a vector of non-negative real numbers representing the branch lengthsof the phylogeny in expected substitutions per site (Siepel and Haussler, 2004a, 2005). The conservedphylogenetic model, denoted ψc, is identical to ψnexcept for the scaling parameter ρ; it is defined asψc= (Q, π, τ , ρβ). The tree topology τ is assumed to be known, and the background distribution π isassumed to be well approximated by the relative frequencies of the four bases in the data set, which caneasily be obtained in a preprocessing step. Therefore, the free parameters of the phylo-HMM are summarizedby the parameter vector θ = (Q, β, ρ, µ, ν), where µ and ν define the transition probabilities and initial-stateprobabilities of the Markov chain for state transitions (Figure 1).The rate matrix Q (treated here as a single free parameter, for simplicity) can be defined by any ofseveral parametric DNA substitution models (see, e.g., Whelan et al., 2001; Felsenstein, 2004). For all resultsdiscussed in this paper, the five-parameter general reversible (REV) (Tavar´e, 1986) substitution model wasused. Assuming the REV model and n present-day species (implying an unrooted tree with 2n−3 branches),the total number of free parameters is 2n + 5. The tree topologies assumed for each data set were as shownin Figure 2. The topologies for the vertebrates and insects are fairly well established, and there was only onetopology possible for the two worms; for the yeasts, we used the topology reported by Rokas et al. (2003).The model used by phastCons can be shown to be a special case of Felsenstein and Churchill’s (1996)1phylo-HMM with two rate categories (states), rates r1= ρ and r2= 1, stationary distribution (f1, f2) =(νµ+ν,µµ+ν), and autocorrelation parameter λ = 1 − µ − ν.2.2 Parameter EstimationLet x = (x1, . . . , xL) be a multiple alignment of n rows (sequences) and L columns, whose ith columnhas probabilities P (xi|ψc) and P (xi|ψn) of being “emitted” by the conserved and nonconserved models,respectively. Because the ancestral bases associated with xi(i.e., the bases at internal nodes of the tree) areunknown, these “emission” probabilities are marginal probabilities, obtained by summing over all possibleancestral bases. Let z = (z1, . . . , zL) be a path through the phylo-HMM, where zi∈ {c, n} indicates whetherthe conserved (c) or nonconserved (n) s tate is visited at position i. Like the ancestral bases, the path isunobserve d, so to compute the marginal probability of the data given the model (the likelihood of the model),one must also sum over paths. The joint probability of the data and a specific path is simply a productover the p ositions of the alignment of emission probabilities and “transition” probabilities (the probabilityof each state zigiven the previous state zi−1). The likelihood function is:L(θ|x) = P (x|θ) =XzLYi=1P (xi|zi, θ)P (zi|zi−1, θ) (1)where we use the notational convention that P (z1|z0) is the probability of beginning with state z1. The“double marginalization” required to compute the likelihood of a phylo-HMM can be accomplished efficientlyusing a combination of Felsenstein’s (1981) “pruning” algorithm and the “forward” algorithm for HMMs(Fe lsenste in and Churchill, 1996; Durbin et al., 1998b; see also Siepel and Haussler, 2004a).In an EM algorithm, a maximum of the likelihood function is found by iteratively maximizing a relatedbut often more tractable quantity, sometimes called the “expected complete log likelihood.” The algorithmconsists of alternating expectation (E) and maximization (M) steps, which are guaranteed eventually toresult in convergence to a local maximum of the likelihood function (Dempster et al., 1977; Durbin et al.,1998b). Let the “complete log likelihood,” denoted l(θ|x, z), be the log of the joint probability of an observedalignment and a specific path:l(θ|x, z) = log P (x, z|θ) = logLYi=1P (xi|zi, θ)P (zi|zi−1, θ)= logYx∈X ,z∈Z[P (x|z, θ)]ux,zYz1,z2∈Z[P (z2|z1, θ)]vz1,z2f(µ, ν)≈Xx∈X ,z∈Zux,zlog P (x|z, θ) +Xz1,z2∈Zvz1,z2log P (z2|z1, θ) (2)where X is the set of distinct columns (s ometime s called “patterns”) in x, Z is the set of states in the2phylo-HMM (Z = {c, n}), ux,zis the number of instances in which a column x is emitted by a state z, andvz1,z2is the number of instances in which a state z1is followed by a state z2. (The counts ux,zand vz1,z2aresufficient statistics for l(θ|x, z).) The function f(µ, ν), dropped in the third step, is equal toνµ+νif z1= cand equal toµµ+νif z1= n; it can safely be ignored as long as the length of the alignment L  1. Theexpected complete log likelihood is the expectation of l(θ|x, z) considering all possible paths, conditional onthe data and a particular version of the parameter vector (θ(t)). Using the notation h·iθ(t)to indicate theseconditional expectations, and letting Ux,zand Vz1,z2be random variables corresponding to the counts ux,zand vz1,z2, the expected complete log likelihood is:hl(θ|X, Z)iθ(t)≈*Xx∈X ,z∈ZUx,zlog P (x|z, θ) +Xz1,z2∈ZVz1,z2log P (z2|z1, θ)+θ(t)=Xx∈X ,z∈ZhUx,ziθ(t)log P (x|z, θ) +Xz1,z2∈ZhVz1,z2iθ(t)log P (z2|z1, θ) (3)where we show approximate equality only because we have dropped the function f(µ, ν). (This descriptionof the EM algorithm follows one given by M. Jordan, book in prep.)The E step of the algorithm consists of computing the “expected counts” hUx,ziθ(t)and hVz1,z2iθ(t). Thisstep can be accomplished using a forward/backward

View Full Document


School:
Email:
New Password:
Confirm Password:

Stanford STATS 345 - Sequence Data and Multiple Alignments

Sign up for free to view:

Please select your school