Molecular evolution cont Lecture 14 Statistics 246 March 9 2004 1 Scoring matrices Scoring or substitution matrices are expressions measuring the evolutionary similarity of amino acids or nucleotide bases at different evolutionary distances They take the form S a b log f a b a b where f a b is a joint distribution on pairs and is the background distribution Usually f is symmetric and is the common marginal distribution At times people introduce scoring matrices without a probabilistic justification and we will mention a couple later The two most widely used scoring matrices for sequence alignment the PAM and the BLOSUM series are associated with implicit statistical tests based on models They test a null hypothesis of non homology versus an alternative of homology similarity doe to common ancestry at a given evolutionary distance for the sequences being compared 2 Scoring matrices for alignment The statistical ideas underlying the scoring matrices are of interest and value in themselves The PAM series are based the Markov chain models from molecular evolution that we met in the last lecture i e they are of the type we used to correct observed sequence distances Jukes Cantor Kimura etc and which we will also see used in in maximum likelihood phylogenetic inference The derivation of the BLOSUM series is different but nonetheless interesting As we will not be discussing either local or global sequence alignment pairwise or multiple I ll refer you to earlier versions of this course or the many excellent discussions in the literature on the topic These include but are not restricted to the well known books by M S Waterman 1995 and by R Durbin et al 1998 3 How scoring matrices work 134 LQQGELDLVMTSDILPRSELHYSPMFDFEVRLVLAPDHPLASKTQITPEDLASETLLI 137 LDSNSVDLVLMGVPPRNVEVEAEAFMDNPLVVIAPPDHPLAGERAISLARLAEETFVM BLOSUM62 C S T P A G N D E Q H R K M I L V F Y W 9 1 1 3 0 3 3 3 4 3 3 3 3 1 1 1 1 2 2 2 4 1 1 1 0 1 0 0 0 1 1 0 1 2 2 2 2 2 3 D D 6 5 1 0 2 0 1 1 1 2 1 1 1 1 1 0 2 2 2 7 1 2 2 1 1 1 2 2 1 2 3 3 2 4 3 4 4 0 2 2 1 1 2 1 1 1 1 1 0 2 2 3 6 0 1 2 2 2 2 2 3 4 4 3 3 3 2 D R 2 6 1 0 0 1 0 0 2 3 3 3 3 2 4 6 2 0 1 2 1 3 3 4 3 3 3 4 5 2 0 0 1 2 3 3 2 3 2 3 5 0 1 1 0 3 2 2 3 1 2 8 0 1 2 3 3 3 1 2 2 5 2 1 3 2 3 3 2 3 5 1 3 2 2 3 2 3 5 1 2 1 0 1 1 4 2 3 0 1 3 4 1 0 1 2 4 1 6 1 3 3 1 7 2 11 C S T P A G N D E Q H R K M I L V F Y W From Henikoff 1996 4 Scoring matrices cont These can also be based on physical chemical similarities and this can be useful for comparing two sequences according to the properties of their residues that may highlight regions of structural similarity They can be identity matrices since by stressing only identities in the alignment stretches of sequence that may have diverged will not penalise any remaining common features As the direct source of residue by residue comparison scores the scoring matrix you choose will have a major impact on the alignment calculated The matrix that performs best will be the matrix that reflects the evolutionary separation of the sequences being aligned 5 Statistical motivation for alignment scores AGCTGATCA Alignment AACCGGTTA Hypotheses H homologous indep sites Jukes Cantor R random indep sites equal freq pr data H pr AA H pr GA H pr CC H 1 p a p d where a agreements d disagreements p 3 1 e 8 t 4 pr data R pr AA H pr GA H pr CC H 1 a 3 d 4 4 pr data H 1 p p log alog d log a d pr data R 1 4 3 4 Since p 3 4 log 1 p 1 4 0 while log p 3 4 0 Thus the alignment score a d where the match score 0 and the mismatch penalty is 0 6 Large and small evolutionary distances Recall that p 3 4 1 e 8 t log 1 p 1 4 log p 3 4 Now note that if t 0 then p 6 t and 1 p 1 and so log4 while log8 t is large and negative That is we see a big difference in the two values of and for small distances Conversely if t is large p 3 4 1 hence p 3 4 1 giving log 1 while 1 p 1 3 4 1 p 1 4 1 3 and so log 1 3 3 Thus the scores are about 3 for a match to 1 for a mismatch for large distances This makes sense as mismatches will on average be about 3 times more frequent than matches The preceding discussion clarifies the statement made earlier that the matrix which performs best will be the matrix that reflects the evolutionary separation of the sequences being aligned 7 Extension to protein sequence comparisons We can do the same with any other Markov substitution matrix for molecular evolution E g with a PAM or BLOSUM matrix of probabilities defined shortly a1 am data b b 1 m a gap free alignment of two a a sequence fragments m pr data H ai p ai bi 2t 1 pr data R ai bi m pr data H log log p ai bi 2t bi pr data R 1 The elements of a log odds score matrix are typically 0 on the diagonal and 0 off the diagonal but not always Also the relative sizes of match and mismatch scores change as PAMs changes Thus PAM 120 is more stringent than PAM 250 while PAM 360 is less stringent than it In particular PAM 0 the identity matrix is the toughest There are plenty of score matrices based on other principles 8 The stationary distribution A probability distribution on A C G T is a stationary distribution of the Markov chain with transition probability matrix P P i j if for all j i i P i j j Exercise Given any initial distribution the distribution at time t of a chain with transition matrix P converges to as t Exercise For the Jukes Cantor and Kimura models the uniform distribution is stationary Hint diagonalize their infinitesimal rate matrices We often assume that the ancestor sequence is i i d 9 Reversibility A Markov chain is called reversible if it satisfies the detailed balance condition for all i j i P i j P …
View Full Document