DOC PREVIEW
Berkeley STATISTICS 246 - Molecular evolution

This preview shows page 1-2-3-18-19-36-37-38 out of 38 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 38 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 38 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 38 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 38 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 38 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 38 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 38 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 38 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 38 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Molecular evolution, cont. Comparing estimation methods. Application to human and mouse sequencesThe resolvent methodResolvent method, cont.Slide 4REVIEW: BLOck SUbstitution Matrices (BLOSUM)REVIEW: BLOSUM matrices, cont.Slide 7REVIEW: BLOSUM, final remarks.An inhomogeneous processSimulation comparison of methodsSimulation comparison, cont.Reversible homogeneous processSmooth inhomogeneous processDiscontinuous inhomogeneous processSummary of comparisonModelling genomic DNA base substitutionData preprocessing and summaryPlot of percent identity against position along chr 22 for each HSPPlot of percent identity against position for the first 1,000 HSPsPlot of percent identity against position for the 5,001st to 6,000th HSPPlot of human GC-content against position along chromosome 22Scatterplot of human and mouse GC-content for each HSPComparing base compositionsHistogram of imbalance between human and mouse base compositionsPlot of composition imbalance against position along chromosome 22Estimating reversible calibrated rate matricesFirst resultsPlot of percent identity against estimated distance in PAMs.Goodness-of-fitGoodness-of-fit, cont.Histogrtam of 1,000  2 statistics on simulated HSPs.GC-specific rate matricesGC-specific rate matrices: symmetric parts and stationary distributionsGoodness-of-fit revisitedPowerPoint PresentationGoodness-of-fit, almost concluded.214 qq-plot of HSPs with composition close to (.2,.3,.2,.3)Goodness-of-fit, concluded.1Molecular evolution, cont.Comparing estimation methods.Application to human and mouse sequencesLecture 16, Statistics 246March 16, 20042The resolvent method Müller and Vingron (2000) proposed a fast estimation method for sequence pairs based on resolvents. It can in fact be applied to multiple aligments generated by a reversible Markov process. For  > 0, the resolvent R of a rate matrix Q is given by R = (I - Q)-1. Solving for Q gives the following formula: Q = I - R-1. It turns out that the resolvent is the Laplace transform of the transition matrices, € Rα= e−αtP(t)dt0∞∫.3Resolvent method, cont. This is the key formula. (Prove it.) Given many pairs of sequences, not necessarily disjoint, that are separated by t PAMs, an unbiased estimate of P(t) can be obtained by normalizing the symmetrized sum of frequency tables. If P(t) can be estimated for a wide range of t, then we can get an estimate of Q via the last two equations. That is the idea. In practice, there are two issues: (i) the distances are unknown, and so must be estimated by ML, and (ii), the estimated distances are discrete, and so interpolation must be used to estimate the rate matrix. Let the estimated distances be 0 < t1 <….< tN . Then the integral is approximately equal to the sum of N pieces:4Resolvent method, cont. which can be evaluated exactly, after replacing the Ps by their estimates. Summing these integrals gives an estimate of R, and by inversion, of Q. € Rα≈ ( +.....tN −1tN∫)e−αtP(t)dt.0t1∫The kth int egral is approximated by linear int erpolatione−αt(P(tk−1) +t − tk−1tk− tk−1tk−1tk∫[P(tk) − P(tk−1)]),5REVIEW: BLOck SUbstitution Matrices (BLOSUM) Henikoff and Henikoff (1992) used an ad hoc method that takes time inhomogeneity into account to construct the BLOSUM (block substitution matrix) matrices. The input is a set of blocks, which are gap-free multiple alignments of segments of homologous amino acid sequences. A frequency table is derived from the blocks by summing over the match and mismatch patterns from all within-block pair-wise comparisons. Since a mismatch such as an A aligned to an S, can be written in two ways, AS and SA, we get rid of the ambiguity by using only AS. In general, a mismatch is represented by sequences XY, where X precedes Y alphabetically. For example, suppose that in a block with six sequences, two columns are as follows: ..AD.. ..AD.. ..AE.. ..AE.. ..AD.. ..SD..6REVIEW: BLOSUM matrices, cont. There are a total of 15 pairwise comparisons. The left column contributes 10 AA and 5 AS pairs to the frequency table. Similarly, the right column contributes 6 DD, 1 EE and 8 DE pairs. Adding these column contributions within the block, and then across all blocks, gives a triangular frequency table. The matrix is symmetrized by adding itself to its transpose. Dividing the matrix by its sum yields a symmetric joint distribution, and a substitution matrix is obtained as described for the PAM matrices. To downweight the contribution of the more closely related sequences to the frequency table, the sequences within each block are clustered. Let  be a fixed number between 0 and 100. Sequences that are more than % similar are “greedily” clustered. In other words, any two sequences that are more than % similar are put in the same cluster, and if each sequence already belongs to some cluster, then the two clusters are combined to form a larger cluster. In the end, the sequences within a block are partitioned into disjoint clusters, so that any two sequences from distinct clusters are less than % similar. It is clear that the clusters are well-defined, i.e., independent of the initial choice of sequences.7REVIEW: BLOSUM matrices, cont. Sequences in the same cluster are downweighted by the cluster size in cross-cluster pairwise comparisons, and pairwise comparisons of sequence in the same cluster do not contribute to the frequency table. In the example, suppose that the first four sequences are clustered while the last two sequences are not. Then the contribution of the left column is the same as an A-A-S column: 1 AA, 2 AS pairs. The right column is effectively (D/E)-D-D, where D/E represents half a D and half an E. Its contribution is 2 DD (1 + 1/2 + 1/2) and 1 DE (1/2 + 1/2) pairs. Equivalently, sequences in the same cluster are replaced by an “average” sequence with fractional number of residues at each position. Then the frequency table is derived as if the average sequences are real sequences; blocks that have only one cluster are left out. Let the symmetric joint distribution be denoted by f . Let  be the row of column sum of f . Then the substitution matrix BLOSUM is given by S(, a, b) = 2log2{f(a,b)/(a)(b)}8REVIEW: BLOSUM, final remarks. If  is 100, then every cluster is of size 1, so f  is an average of the


View Full Document

Berkeley STATISTICS 246 - Molecular evolution

Documents in this Course
Meiosis

Meiosis

46 pages

Meiosis

Meiosis

47 pages

Load more
Download Molecular evolution
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Molecular evolution and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Molecular evolution 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?