UW-Madison GENETICS 629 - Maximum Likelihood and the Bootstrap

Unformatted text preview:

Maximum Likelihood and the BootstrapBret LargetDepartments of Botany and of StatisticsUniversity of Wisconsin—MadisonSeptember 29, 2011ML+Bootstrap 1 / 17Principle of Maximum LikelihoodGiven parameters θ and data XThe function f (X | θ) is the probability of observing data X givenparameter θ. (Both X and θ can be multi-dimensional.)Keeping θ fixed, and treating f as a function of X , the totalprobability is one.ML+Bootstrap 2 / 17Principle of Maximum LikelihoodThe function L(θ) = f (X | θ) with X fixed and θ unknown is calledthe likelihood function.The principle of maximum likelihood is to estimate θ with the valueˆθthat maximizes L(θ).In practice, it is common to maximize the log-likelihood,`(θ) = ln L(θ).This is because X often takes the form of an independent sample sothatL(X ) =nYi =1f (Xi| θ), `(θ) =nXi =1ln f (Xi| θ)ML+Bootstrap 3 / 17Coin-tossing ExampleA coin has a probability θ of being a head.Consider tossing the coin 100 times. The probability of each singlesequence with exactly x heads is f (x | θ) = px(1 − p)100−x.Say we observe the sequenceHHTHTHHT . . . TTHwhere heads appear 57 times.The maximum likelihood estimate is the valueˆθ that maximizes thefunctionL(θ) = θ57(1 − θ)43,or, equivalently that maximizes`(θ) = 57(ln θ) + 43(ln(1 − θ)) .Simple calculus and common sense lead to the estimateˆθ = 0.57.ML+Bootstrap 4 / 17Maximum-likelihood edge lengthsFor the Jukes-Cantor model, a pair of sequences have x sites withobserved differences and n − x sites with the same base.The probability of any given sequence pair isL(d) =14n×14−14e−43dx×14+34e−43dn−xwhich has the formL(θ) = C × θx(1 − 3θ)n−xwhereθ =14−14e−43d.Solving the calculus problem yieldsˆθ =x3n.Plugging in and solving for d givesˆd = −34ln1 −43xnML+Bootstrap 5 / 17Computing Likelihood on a TreeG G A G GML+Bootstrap 6 / 17Transition ProbabilitiesP(0.1) =0.90 0.04 0.04 0.030.03 0.91 0.04 0.030.03 0.04 0.91 0.030.03 0.04 0.04 0.90P(0.2) =0.81 0.07 0.07 0.050.05 0.83 0.07 0.050.05 0.07 0.83 0.050.05 0.07 0.07 0.81P(0.4) =0.67 0.13 0.13 0.080.08 0.71 0.13 0.080.08 0.13 0.71 0.080.08 0.13 0.13 0.67ML+Bootstrap 7 / 17Model Selection12 rbcL genes from 12 plant speciesModel p `JC69 21 −6262.01K80 22 −6113.86HKY85 25 −6101.76HKY85 + Γ526 −5764.26HKY85 + C 35 −5624.70The AIC criterion is to select the model with the lowest AIC score,which isAIC = −2 ln(likelihood) + 2 × (# of parameters)AIC balances the competing goals to fit the data well (likelihoodhigh) and keep the model simple (few parameters).For this data, the HKY85+C model is the best among thosecompared; using 9 more parameters yielded an improvement inloglikelihood of over 139, which lowered the AIC by about 130.ML+Bootstrap 8 / 17The Bootstrap: A brief historyThe bootstrap was introduced to the world by Brad Efron, chair ofthe Department of Statistics at Stanford University, in 1979.The bootstrap is one of the most widely used new method instatistics that was invented within the past 50 years.In a special issue of Statistical Science that celebrates the 25thanniversary of the bootstrap, Brad Efron uses its application tophylogenetics as one of a small number of examples to illustrate itsuse and importance.ML+Bootstrap 9 / 17The General Bootstrap FrameworkWe have a sample x1, . . . , xndrawn from a distribution F from whichwe wish to estimate a parameter θ using a statisticˆθ = T (x1, . . . , xn).(We might think of θ as being the median of the distribution, forexample, andˆθ = T (x1, . . . , xn) as the sample median.)If we wanted to compute the standard error of the estimate, we wouldideally compute the standard deviation of T (X1, . . . , Xn) whereXi∼ iid F .We could estimate this to any desired degree of accuracy bygenerating a large enough number (say B) of random samplesX1, . . . , Xn, computingˆθi= T (X1, . . . , Xn) for the ith such sample,and then computing the standard deviation of these estimates.sPBi =1(ˆθi− θ)2BML+Bootstrap 10 / 17The Key IdeaUnfortunately, we cannot take multiple samples from F .However, our original sample x1, . . . , xnis an estimate of thedistribution F .Instead of taking samples from F , we could sample from the estimateddistributionˆF by sampling from our original sample with replacement.ML+Bootstrap 11 / 17The ProcedureWe sample n values x∗1, . . . , x∗nwith replacement from x1, . . . , xn.It is very likely that some of the original x values will be sampledmultiple times and others will not be sampled at all.For each sample, compute the estimate of θ using the originalstatistic.The ith estimate isˆθ∗i= T (x∗1, . . . , x∗n).Repeat this B times and compute the standard deviation of thebootstrap estimates around the estimate from the original sample.sPBi =1(ˆθ∗i−ˆθ)2BML+Bootstrap 12 / 17Why it worksIf the sampling distribution of the bootstrap sample estimateˆθ∗around the estimateˆθ is similar to the sampling distribution of theestimateˆθ around the true value θ, then the boostrap standard errorwill be a good estimate of the real standard error.The bootstrap can be used to estimate bias, variance, for confidenceintervals, and for hypothesis testing in many situations.It does depend critically on the assumption of independence of theoriginal sample.ML+Bootstrap 13 / 17Consensus TreesA strict consensus tree shows only those clades that appear in everysampled tree.A majority rule consensus tree shows all clades that appear in morethan half the sample of trees.(Notice that two clades that each appear in more than half thesampled trees must appear in at least one tree together, implying thatthey are compatible with one another.)A priority consensus tree adds clades to the majority rule consensustree in order of decreasing frequency in the sample provided thatthese clades do not conflict with a clade with higher frequency.ML+Bootstrap 14 / 17ExampleE K V P AE V P K AE V P K AE V K P AE V K P AML+Bootstrap 15 / 17Dynamic Exploration of Tree SamplesShow off Mark Derthick’s Summary Tree Explorer.Software is free and available athttp://cityscape.inf.cs.cmu.edu/phylogeny/ .ML+Bootstrap 16 / 17Interpretation of Bootstrap ProportionsWhat does a bootstrap porportion mean? Let me count the ways.Confidence that the clade is in the true tree.Bayesian posterior probability that the clade is in the true tree.One minus


View Full Document

UW-Madison GENETICS 629 - Maximum Likelihood and the Bootstrap

Download Maximum Likelihood and the Bootstrap
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Maximum Likelihood and the Bootstrap and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Maximum Likelihood and the Bootstrap 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?