Molecular evolution cont Statistics 246 Week 6 Spring 2006 Lecture 1 1 Statistical motivation for alignment scores AGCTGATCA Alignment AACCGGTTA Hypotheses H homologous indep sites Jukes Cantor R random indep sites equal freq pr data H pr AA H pr GA H pr CC H 1 p a p d where a agreements d disagreements p 3 1 e 8 t 4 pr data R pr AA H pr GA H pr CC H 1 a 3 d 4 4 pr data H 1 p p log alog d log a d pr data R 1 4 3 4 Since p 3 4 log 1 p 1 4 0 while log p 3 4 0 Thus the alignment score a d where the match score 0 and the mismatch penalty is 0 2 Large and small evolutionary distances Recall that p 3 4 1 e 8 t log 1 p 1 4 log p 3 4 Now note that if t 0 then p 6 t and 1 p 1 and so log4 while log8 t is large and negative That is we see a large difference in the two values of and for small distances Does this make sense Conversely if t is large p 3 4 1 hence p 3 4 1 giving log 1 while 1 p 1 3 4 1 p 1 4 1 3 and so log 1 3 3 Thus the scores are about 3 for a match to 1 for a mismatch for large distances This makes sense as mismatches will on average be about 3 times more frequent than matches 3 DNA sequence alignment The preceding discussion shows that molecular evolutionary ideas underly sequence aligmnent Does that make sense It attempts to clarify the statement made earlier that the matrix which performs best at sequence alignment will be the matrix that reflects the evolutionary separation of the sequences being aligned To permit our analysis to deal with real DNA sequence alignment we need to include the idea of searching for a best alignment and incorporate insertions and deletions This is most elegantly done via the notion of pair HMM see the book by Durbin et al Biological sequence analysis Cambridge U Press 4 Extension to protein sequence comparisons We can do the same with any other Markov substitution matrix for molecular evolution E g with a PAM or BLOSUM matrix of probabilities defined shortly a1 am data b b 1 m a gap free alignment of two a a sequence fragments m pr data H ai p ai bi 2t 1 pr data R ai bi m pr data H log log p ai bi 2t bi pr data R 1 The elements of a log odds score matrix are typically 0 on the diagonal and 0 off the diagonal but not always Also the relative sizes of match and mismatch scores change as PAMs changes Thus PAM 120 is more stringent than PAM 250 while PAM 360 is less stringent than it In particular PAM 0 the identity matrix is the toughest There are plenty of score matrices based on other principles 5 The stationary distribution A probability distribution on A C G T is a stationary distribution of the Markov chain with transition probability matrix P P i j if for all j i i P i j j Exercise Some conditions apply Given any initial distbn the distbn at time t of a chain with transition matrix P converges to as t Exercise For the Jukes Cantor and Kimura models the uniform distbn is stationary We often assume that the ancestor sequence is i i d 6 Reversibility A Markov chain is called reversible if it satisfies the detailed balance condition for all i j i P i j P j i j Under reversibility the human sequence can be considered the ancestor of the orangutan sequence and vice versa Proof next slide where anc denotes the ancestor of humans and orangutans This turns out to be helpful for some calculations Exercise Both the Jukes Cantor and Kimura models are reversible 7 Proof pr orangutan G human C i pr anc i pr orangutan G anc i pr human C anc i i i P t i G P t i C i G P t G i P t i C by reversibility G P 2t G C by the addition rule F 2t G C Here the matrix F 2t is the joint distribution of the nucleotides at the given position at times 0 and 2t It is symmetric and both its rows and columns sum to the stationary distribution Exercise Check these last assertions and also that the root could also be at any location s from orangutan t s from human 8 PAM matrices The PAM point accepted mutations matrices by Dayhoff et al 1968 were the first empirical substitution matrices In the same work they also invented the notion of evolutionary time described in the previous lecture Their dataset consisted of families of closely related amino acid sequences such that every pair of homologous sequences was more than 85 identical A tree was constructed for each family and the ancestral sequences were inferred by parsimony The number of occurrences of all 400 types of amino acid substitutions between neighboring sequences were collected into a 20 20 frequency table C which is symmetrized by adding the transpose to itself If C is standardized i e each entry is divided by the sum of all entries then we have a symmetric joint distribution of two residues separated by a certain evolutionary distance Clearly dividing each row of C by its sum gives a transition matrix P Dayhoff et al used a somewhat roundabout yet interesting approach involving the notion of mutability 9 PAM matrices cont The mutability m a of amino acid a is the probability that a is replaced substituted by a different amino acid and is estimated by m a b aC a b bC a b The transition probability from amino acid a to amino acid b is estimated as P a b m a C a b b aC a b C a b bC a b The equilibrium distribution of amino acids is estimated by the row or column sums of C To get the transition matrix P 1 for 1 PAM from P consider the family of transition matrices P defined by P a a 1 m a and P a b P a b when a b 10 PAM matrices yet more It is easy to verify that P is a transition matrix with equilibrium distribution After 1 PAM of substitution approximately 99 of the amino acids are not replaced Thus is chosen so that a P a a 0 99 or 0 01 m a With set to this value P is roughly equal to P 1 The substitution matrix for 1 PAM is a scaled version of the standard formula PAM1 a b 10 log10 P 1 a b b This matrix is suitable for aligning closely related protein sequences P 1 is raised to several different powers the highest being 250 which gives the PAM250 substitution …
View Full Document