Berkeley STATISTICS 246 - Molecular evolution - D1876253

Home> Schools> University of California, Berkeley> (STATISTICS) > STATISTICS 246> Molecular evolution

DOC PREVIEW

Berkeley STATISTICS 246 - Molecular evolution

School name University of California, Berkeley

Course Statistics 246- Statistical Genetics

Pages 27

This preview shows page 1-2-3-25-26-27 out of 27 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 27 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 27 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 27 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 27 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 27 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 27 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 27 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

1Molecular evolution, cont.Statistics 246 Spring 2006Week 6 Lecture 12Statistical motivation for alignment scoresAGCTGATCA...AACCGGTTA...Alignment:H = homologous (indep. sites, Jukes-Cantor)R = random (indep. sites, equal freq.)Hypotheses:€ pr(data | H) = pr(AA | H) pr(GA | H) pr(CC | H)...= (1− p)apd, where a = # agreements, d =# disagreements, p =34(1− e−8αt).€ pr(data | R) = pr(AA | H) pr(GA | H)pr(CC | H)...= (14)a(34)d,log{pr(data | H)pr(data | R)} = alog1− p1/4+ d logp3/4= a ×σ+ d × (−µ).€ Since p<3/4, σ = log((1-p)/(1/4))>0, while -µ= log(p/(3/4))<0.Thus the alignment score = a×σ + d×(-µ), where thematch score σ > 0, and the mismatch penalty is -µ < 0.3Large and small evolutionary distances Recall that p = (3/4)(1-e-8αt), σ = log((1-p)/(1/4)), -µ = log(p/(3/4)). Now note that if αt ≈ 0, then p ≈ 6αt, and 1-p ≈ 1, and so σ ≈ log4, while-µ ≈ log8αt is large and negative. That is, we see a large difference in thetwo values of σ and µ for small distances. Does this make sense?Conversely, if αt is large, p = (3/4)(1-ε), hence p/(3/4) = 1- ε, givingµ = -log(1- ε) ≈ ε, while 1-p = (1+3ε)/4, (1-p)/(1/4) = 1+3ε, and soσ = log(1+3ε) ≈ 3ε. Thus the scores are about 3 (for a match) to 1(for a mismatch) for large distances. This makes sense, as mismatcheswill on average be about 3 times more frequent than matches.4DNA sequence alignment The preceding discussion shows that molecular evolutionaryideas underly sequence aligmnent. Does that make sense?It attempts to clarify the statement made earlier, that the matrixwhich performs best at sequence alignment will be the matrixthat reflects the evolutionary separation of the sequences beingaligned. To permit our analysis to deal with real DNA sequence alignment,we need to include the idea of searching for a best alignment,and incorporate insertions and deletions. This is most elegantlydone via the notion of pair HMM, see the book by Durbin et al,Biological sequence analysis, Cambridge U Press.5We can do the same with any other Markov substitution matrix for molecularevolution. E.g. with a PAM or BLOSUM matrix of probabilities, defined shortly.a1 ..... amb1 ..... bmdata =a gap free alignment of two a.a. sequence fragmentsThe elements of a log-odds score matrix are typically > 0 on the diagonal and< 0 off the diagonal, but not always. Also the relative sizes of match andmismatch scores change as #PAMs changes. Thus PAM(120) is morestringent than PAM(250), while PAM(360) is less stringent than it.In particular, PAM(0) = the identity matrix is the toughest.There are plenty of score matrices based on other principles.Extension to protein sequence comparisons€ pr(data | H) =π(ai)1m∏p(ai,bi;2t), pr(data | R) =π(ai)π(bi)log{pr(data | H)pr(data | R)} = log{p(ai,bi;2t) /1m∑π(bi)}.6The stationary distribution A probability distribution π on {A,C,G,T} is a stationary distribution of theMarkov chain with transition probability matrix P = P(i,j), if for all j, ∑i π(i) P(i,j) = π(j). Exercise. (Some conditions apply.) Given any initial distbn, the distbn attime t of a chain with transition matrix P converges to π as t → ∞. Exercise. For the Jukes-Cantor and Kimura models, the uniform distbn isstationary. We often assume that the ancestor sequence is i.i.d π.7ReversibilityA Markov chain is called reversible if it satisfies the detailedbalance condition: for all i,j π (i)P(i,j) = P(j,i)π(j).Under reversibility, the human sequence can be considered theancestor of the orangutan sequence and vice versa. Proofnext slide, where anc denotes the ancestor of humans andorangutans. This turns out to be helpful for some calculations. Exercise. Both the Jukes-Cantor and Kimura models arereversible.8Proof pr( orangutan = G, human = C )= ∑i pr(anc = i )pr(orangutan = G | anc = i )pr(human = C | anc = i )= ∑i π(i)P(t,i,G)P(t,i,C)= ∑i π(G)P(t,G,i)P(t,i,C) (by reversibility)= π(G)P(2t,G,C) (by the addition rule)= F(2t,G,C). Here the matrix F(2t) is the joint distribution of the nucleotides atthe given position at times 0 and 2t. It is symmetric, and both itsrows and columns sum to the stationary distribution π. Exercise. Check these last assertions, and also that the rootcould also be at any location s from orangutan, t-s from human.9PAM matricesThe PAM (point accepted mutations) matrices by Dayhoff et al (1968) werethe first empirical substitution matrices. In the same work, they alsoinvented the notion of evolutionary time, described in the previous lecture.Their dataset consisted of families of closely related amino acid sequences,such that every pair of homologous sequences was more than 85%identical.A tree was constructed for each family, and the ancestral sequences wereinferred by parsimony. The number of occurrences of all 400 types ofamino acid substitutions between neighboring sequences were collectedinto a 20×20 frequency table C, which is symmetrized by adding thetranspose to itself. If C is standardized, i.e., each entry is divided by thesum of all entries, then we have a symmetric joint distribution of tworesidues separated by a certain evolutionary distance.Clearly, dividing each row of C by its sum gives a transition matrix P. Dayhoffet al used a somewhat roundabout yet interesting approach involving thenotion of mutability.10PAM matrices, cont.The mutability m(a) of amino acid a is the probability that a is replaced(substituted by a different amino acid) and is estimated by m(a) = ∑b≠aC(a, b) / ∑bC(a, b).The transition probability from amino acid a to amino acid b is estimated as P(a, b) = m(a) {C(a,b) / ∑b≠aC(a, b)} = C(a,b) / ∑bC(a, b) .The equilibrium distribution π of amino acids is estimated by the row or columnsums of C.To get the transition matrix P(1) for 1 PAM, from P, consider the family oftransition matrices Pλ defined by Pλ(a,a)= 1-λm(a), and Pλ(a,b) = λP(a,b),when a≠b.11PAM matrices, yet moreIt is easy to verify that P is a transition matrix, with equilibrium distribution π. After 1PAM of substitution, approximately 99% of the amino acids are not replaced. Thus, λis chosen so that ∑ π(a)Pλ(a, a) = 0.99,or λ = 0.01/∑ m(a). With λ set to this value P λ is roughly equal to P(1).The substitution matrix for 1 PAM is a scaled version of the standard formula: PAM1(a, b) = 10 log10 {P(1,a,b)/π(b)}.This matrix is suitable for aligning closely related protein

View Full Document