Molecular evolution cont Estimating rate matrices Lecture 15 Statistics 246 March 11 2004 1 REVIEW The Jukes Cantor model 1969 Common ancestor of human and orang Q t time units P t human now Consider e g the 2nd position in a globin2 Alu1 3 3 3 3 r s s s s r s s s s r s s s s r r 1 3e 4 t 4 s 1 e 4 t 4 2 REVIEW Jukes Cantor adjustment common ancestor still 2nd position in a globin Alu 1 Assume that the common ancestor has A G C or T with probability 1 4 G orang C human 3 4 Then the chance of the nt differing p 3 4 1 e 8 t 3 4 1 e 4k 3 since k 2 3 t Solving for k estimating distance in PAMs t 3 REVIEW Estimating the evolutionary distance between two sequences Suppose two aligned protein sequences a1 an and b1 bn are separated by t PAMs Under a reversible substitution model that is i i d across sites the likelihood function of t is L t pr a1 an b1 bn model k F t ak bk a b F t a b c a b where c a b k ak a bk b and F t a b a P t a b b P t b a F t b a Maximizing this quantity in t with F known gives the maximum likelihood 4 estimate of t This generalizes the distance correction with Jukes Cantor Acknowledgement Von Bing Yap for joint work summarized here and in the previous and next lecture 5 From aligned DNA or protein sequences to evolutionary trees The starting point for a molecular phylogenetic analysis is a set of sequences almost always aligned The end result is almost always a tree Along the way attention needs to be paid to substitution process operating in the sequences and to possible rate variation along the sequences and down the tree The two main approaches to tree building are a distance based methods which work from pairwise distances between the sequences and b characterbased methods which work directly from the multiply aligned sequences We ll briefly mention both referring you to the literature for fuller details Both make use of rate matrices 6 Building trees distance methods There are many ways of building trees using distance methods All start by computing the pairwise distances between the sequences to be at the tips of the tree usually along the lines we discussed in the last lecture i e ML distance using a rate matrix One of the oldest distance methods still widely used though rather discredited in the molecular evolutionary context is UPGMA This stands for unweighted pair group method with arithmetic means It is easy to understand quickly and so I will describe it I don t recommend it A more recent and much more satisfactory method in molecular evolution is the neighbour joining approach abbrev NJ It takes longer to explain so I won t give it here There are many places where the details of this and other methods are given including Durbin et al 1998 and the recent excellent book by the master Joseph Felsenstein Inferring phylogenies Sinauer 2004 7 Beta globins revisited 10 BG human BG macaque BG bovine BG platypus BG chicken BG shark M V H M L W W T S S P A G A E E G V E K L S N A Q H 20 A L E V I I T A T N G T L F T W G K K S V I 50 BG human BG macaque BG bovine BG platypus BG chicken BG shark R F Y F E A G S A N F L G K D N E L F S T N D D D L I V K G N S T Q F A S T T Q A K Q D L S T S S S A P A A C D G T S A Y V I G M L G N N P K D E D D K F T S A P P Q V E E Q V L C T Q A A D I V I K D N A H E S V L C L G G A A E K A L E K L K H C A D E K E L H V G A A 40 R L M L F V I I V Y P W T 70 K M V K R A E H G A K A K V L T 100 130 BG human BG macaque BG bovine BG platypus BG chicken BG shark N K D 60 90 BG human BG macaque BG bovine BG platypus BG chicken BG shark 30 D G D T T 80 A S S S F L S G G G D N V G A A A L M V V A N K K K T H N N L 110 P V E N S F R K K N K L R L A L I A G S H R K K R E Y H G A N D K V I C Q T L F D G 120 V I I C V V I V V E L A G H R R A I H N L F L G S S K 140 Y F W W W Q E K V L L Y V F A S R G G V V V A V N H H D means same as reference sequence means deletion 8 UPGMA tree for beta globins BG shark BG chicken BG platypus BG bovine 9 BG macaque 1 BG human Neighbor joining tree for globins myo human alpha human epsilon human gamma human delta human beta human 10 Today s main task We will discuss three methods of estimating a calibrated reversible rate matrix Q given aligned leaf sequences on multiple unrooted phylogenetic trees whose topologies and branch lengths are known All three methods are consistent in the sense that the methods are asymptotically unbiased as the sequences become infinitely long Moreover evolutionary distances between sequences are explicitly accounted for unlike with the PAM method The maximum likelihood ML method is natural for any parametric model and has well known theoretical properties Maximum partial likelihood MPL is particularly well suited to Markov processes and can be efficiently implemented via an EM algorithm The resolvent method RES is quite a different technique In practice the phylogenetic tree can also be estimated from the data When the tree topology is known e g when there are just 2 or 3 leaf nodes the branch lengths can be estimated by ML given the rate matrix One can estimate both the rate matrix and the branch lengths by alternating between the two steps estimate Q given the branch lengths estimate the branch lengths given Q Estimating the tree topology as well 11 is a harder …
View Full Document