Molecular evolution cont Comparing estimation methods Application to human and mouse sequences Lecture 16 Statistics 246 March 16 2004 1 The resolvent method M ller and Vingron 2000 proposed a fast estimation method for sequence pairs based on resolvents It can in fact be applied to multiple aligments generated by a reversible Markov process For 0 the resolvent R of a rate matrix Q is given by R I Q 1 Solving for Q gives the following formula Q I R 1 It turns out that the resolvent is the Laplace transform of the transition matrices R t 0 e P t dt 2 Resolvent method cont This is the key formula Prove it Given many pairs of sequences not necessarily disjoint that are separated by t PAMs an unbiased estimate of P t can be obtained by normalizing the symmetrized sum of frequency tables If P t can be estimated for a wide range of t then we can get an estimate of Q via the last two equations That is the idea In practice there are two issues i the distances are unknown and so must be estimated by ML and ii the estimated distances are discrete and so interpolation must be used to estimate the rate matrix Let the estimated distances be 0 t1 tN Then the integral is approximately equal to the sum of N pieces 3 Resolvent method cont t1 tN 0 t N 1 R e t P t dt The kth int egral is approximated by linear int erpolation tk t t k 1 t tk 1 e P t k 1 t t P tk P tk 1 k k 1 which can be evaluated exactly after replacing the Ps by their estimates Summing these integrals gives an estimate of R and by inversion of Q 4 REVIEW BLOck SUbstitution Matrices BLOSUM Henikoff and Henikoff 1992 used an ad hoc method that takes time inhomogeneity into account to construct the BLOSUM block substitution matrix matrices The input is a set of blocks which are gap free multiple alignments of segments of homologous amino acid sequences A frequency table is derived from the blocks by summing over the match and mismatch patterns from all within block pair wise comparisons Since a mismatch such as an A aligned to an S can be written in two ways AS and SA we get rid of the ambiguity by using only AS In general a mismatch is represented by sequences XY where X precedes Y alphabetically For example suppose that in a block with six sequences two columns are as follows AD AD AE AE AD SD 5 REVIEW BLOSUM matrices cont There are a total of 15 pairwise comparisons The left column contributes 10 AA and 5 AS pairs to the frequency table Similarly the right column contributes 6 DD 1 EE and 8 DE pairs Adding these column contributions within the block and then across all blocks gives a triangular frequency table The matrix is symmetrized by adding itself to its transpose Dividing the matrix by its sum yields a symmetric joint distribution and a substitution matrix is obtained as described for the PAM matrices To downweight the contribution of the more closely related sequences to the frequency table the sequences within each block are clustered Let be a fixed number between 0 and 100 Sequences that are more than similar are greedily clustered In other words any two sequences that are more than similar are put in the same cluster and if each sequence already belongs to some cluster then the two clusters are combined to form a larger cluster In the end the sequences within a block are partitioned into disjoint clusters so that any two sequences from distinct clusters are less than similar It is clear that the clusters are well defined i e independent of the initial choice of sequences 6 REVIEW BLOSUM matrices cont Sequences in the same cluster are downweighted by the cluster size in crosscluster pairwise comparisons and pairwise comparisons of sequence in the same cluster do not contribute to the frequency table In the example suppose that the first four sequences are clustered while the last two sequences are not Then the contribution of the left column is the same as an A A S column 1 AA 2 AS pairs The right column is effectively D E DD where D E represents half a D and half an E Its contribution is 2 DD 1 1 2 1 2 and 1 DE 1 2 1 2 pairs Equivalently sequences in the same cluster are replaced by an average sequence with fractional number of residues at each position Then the frequency table is derived as if the average sequences are real sequences blocks that have only one cluster are left out Let the symmetric joint distribution be denoted by f Let be the row of column sum of f Then the substitution matrix BLOSUM is given by S a b 2log2 f a b a b 7 REVIEW BLOSUM final remarks If is 100 then every cluster is of size 1 so f is an average of the substitution patterns over all distances If is a small value such as 20 then there are a small number of large clusters and the similarity between the sequences from distinct clusters ranges from 0 to 20 so f depends only on the substitution patterns of the distantly related sequences Thus reducing the threshold gives substitution matrices which are more suitable for aligning distantly related sequences In this context BLOSUM40 is to be preferred to BLOSUM80 when aligning distantly related protein sequences 8 An inhomogeneous process In order to compare methods of estimating rate matrices we are going to define a simple class of inhomogeneous processes i e processes for which no single rate matrix Q describes the evolution Let Q1 be a calibrated reversible rate matrix with all off diagonal entries strictly positive and let be its stationary distribution Define a reversible matrix Q0 as follows Q0 a b b if a b Q0 a a b a b if a b Clearly Q0 is the simplest reversible rate matrix with stationary distribution Calibrate Q0 by scaling and note that Q0 commutes with Q1 Consider the process Q defined by Q x Q0 if 0 x 100 Q x Q1 if 100 x 200 Note that Q has stationary distribution is reversible and calibrated Indeed the same would be true of the more general form Q x Q0 g x Q1 Q0 x 0 for any integrable g 0 0 1 Using some product integral theory we omit we can get well defined inhomogeneous Markov processes having the Q x as instantaneous rate matrices at time x 0 We use one further notion a joint distribution f on amino acids that is embeddable in a continuous time Markov process can be written f exp Q t for a calibrated Q stationary distribution and effective divergence time t 9 Simulation comparison of methods We will compare the BLOSUM resolvent and ML methods in the following way Let 30 35 85 Given sequence blocks generated by a substitution process run …
View Full Document