Unformatted text preview:

1Pairwise sequence alignment global and localMultiple sequence alignmentlocalglobalSubstitution matricesDatabase searchingBLASTEvolutionary tree reconstructionRNA structure predictionGene FindingProtein structure predictionSequence statisticsComputational genomics…1. BLAST (Karlin‐Altschul) StatisticsExpected number of ungapped alignments with score S found with random sequences is:E = Kmn e‐λSNote that E is proportional to the size of the search space, mn, and decreases exponentially with the score, Swhere K is a constant that depends on S[i,j] and can be computed from the theory for any scoring function. The parameter λ is specified by the equation1=Σpipje–λS[ij]2E‐values depend on K and λ, '2SmnEwhich in turn depend on the scoring matrix, S[i,j]. With normalized bit scores, E values from database searches with different scoring matrices can be compared.SKmneEBy normalizing the alignment scores2lnln'KSSwe obtain E values that are independent of K, λ and S[i,j].2. Normalized bit scores3. Target frequenciesGiven sequences a and b:•Alternate hypothesis (Ĥa): a and b are related at n PAMs divergence.– Residues i and j are aligned with “target” frequencies, qnij•Null hypothesis (Ĥ0): a and b are unrelated.– Residues i and j are aligned with background frequencies, pipjjpipijqjiSnn2log],[ Note that the PAM and BLOSUM matrices were constructed by estimating qijfrom data. However, any scoring matrix (that satisfies the appropriate assumptions for Karlin Altschul statistics) can be expressed as a log odds matrix of the form 3The frequencies qijin the above equation are the characteristic target frequencies of the matrix S[] . In other words, qijis the frequency with which i is aligned with j in Maximal Segment Pairs (MSPs) obtained with S[]. Recall that an MSP is “the highest scoring pair of identical length segments chosen from two sequences. The boundaries of an MSP are chosen to maximize its score, so an MSP may be of any length.”1],[ jinSeppqjiijnTarget frequencies for substitution matrix S[]can be estimated empirically as follows:• Generate “random” sequences from background probabilities• Find MSPs in pairs of random sequences using S[] to score alignments• Count target frequencies in those MSPsTarget frequencies can also be estimated theoretically using the equation:1Altschul et al. J Mol Biol 215: 403‐10 (1990)“Theorem” (Karlin and Altschul, 1990)The best scoring matrix for distinguishing significant alignments from chance alignments is the scoring matrix that gives the greatest difference in scores between related alignments and chance alignments. For sequences diverged by n PAMs, the best discrimination is obtained by the matrix corresponding to the qnijfrom related sequences at the evolutionary distance of interest.jpipijqjiSnn2log],[ 4Proof by contradiction:– Suppose 1. S*[ ] is the matrix that best distinguishes chance alignments from related alignments at a given evolutionary distance and let2. the frequencies of observing i paired with j in MSPs (locally maximal ungapped alignments) obtained with S*[ ] are not q*ij.–Then there exists some x and y in Σ that are aligned in MSPs with a frequency greater than q*ij.–We can increase the score of the MSPs by increasing the score for aligning xwith y , indicating that S*[x,y] does not have the best discriminatory power, leading to a contradiction.],[*jiSeppqjiijImplicationsBLAST will give reasonable accuracy as long as the empirical target frequencies, qij,, in the alignments of interest do not deviate too far from the theoretical target frequencies:],[ jiSeppqjiijReasonable accuracy can be achieved with two or three matrices.5The average score (in bits) per alignment position when using a PAM Mmatrix to compare sequences in fact separated by D PAMs(Calculated by simulation)Efficiency = Score with PAM MScore with PAM D= 94% efficiency Choosing your scoring matrix1. BLAST will give reasonable accuracy as long as the empirical target frequencies do not deviate too far from the theoretical target frequencies• Use PAM40, PAM120 & PAM240, or PAM120 & PAM2502. The lower the relative entropy, H, the longer the minimum alignment that is distinguishable from chance.6• The scale indicates % identity between aligned sequences• The Twilight Zone– Around 20%-35% identity– “Random” alignments begin to appearThe “Twilight” Zone1007550250Pairwise searchesPSSM/HMM searchesStructure prediction% IdentityTwilight zone4. Information content of substitution matrices and alignmentsRecall that the expected number of ungapped alignments with score S found with random sequences is: E = Kmn e‐λS. Intuitively, (for a fixed database) the probability of finding a significant match depends on the length of the query, m, and the score of the alignment, S. S, in turn, depends on the ratio between qijand pipj. Another way to think about this: the probability of finding a significant match depends the amount of information that can be derived from an alignment in order to discriminate between related sequences (Ĥa) and chance alignments (Ĥ0). To ask this question, we can take advantage of a formalism from information theory: Relative Entropy.7Relative Entropy also called the Kullback‐Leibler DivergenceSuppose you have a general framework with an alternate hypothesis Ĥaand a null hypothesis Ĥ0.Event i occurs with probability pi under the null hypothesis and probability qiunder the alternate hypothesis The relative entropy,iiiipqqH logis the expected discrimination information: the information available to discriminate in favor of hypothesis Ĥaagainst hypothesis Ĥ0, given Ĥais true. In the BLAST context, the relative entropy gives information available to distinguish related alignments at n PAMs (Ĥa) from chance alignments (Ĥ0) given a particular substitution matrix Sn[]jinijnjijiijnijnjiSqppqqH,,],[logBLOSUM PAM Sequence identity bits/site bits/site20 2.95 83%30 2.5760 2.00 63%90 1.18 100 1.18 43%80 0.99 120 0.98 38%60 0.66 160 0.70 30%50 0.52 200 0.51 25%45 0.38 250 0.36 20%The relative entropy of a substitution matrix is given in bits per position and can be calculated from S[] using the equationsjinijnjiSqH,],[],[ jinSeppqjiijnand8How long does a sequence have to be in order to find a statistically significant match at a given evolutionary divergence? One


View Full Document

CMU BSC 03711 - lecture

Documents in this Course
lecture

lecture

8 pages

Lecture

Lecture

3 pages

Homework

Homework

10 pages

Lecture

Lecture

17 pages

Delsuc05

Delsuc05

15 pages

hmwk1

hmwk1

2 pages

lecture

lecture

6 pages

Lecture

Lecture

10 pages

barnacle4

barnacle4

15 pages

review

review

10 pages

Homework

Homework

10 pages

Midterm

Midterm

12 pages

lecture

lecture

11 pages

lecture

lecture

32 pages

Lecture

Lecture

7 pages

Lecture

Lecture

17 pages

Lecture

Lecture

12 pages

Lecture

Lecture

21 pages

Lecture

Lecture

11 pages

Lecture

Lecture

28 pages

Homework

Homework

13 pages

Logistics

Logistics

11 pages

lecture

lecture

11 pages

Lecture

Lecture

8 pages

Lecture

Lecture

9 pages

lecture

lecture

8 pages

Problem

Problem

6 pages

Homework

Homework

10 pages

Lecture

Lecture

9 pages

Problem

Problem

7 pages

hmwk4

hmwk4

7 pages

Problem

Problem

6 pages

lecture

lecture

16 pages

Problem

Problem

8 pages

Problem

Problem

6 pages

Problem

Problem

13 pages

Problem

Problem

11 pages

Notes

Notes

7 pages

Lecture

Lecture

7 pages

Lecture

Lecture

10 pages

Lecture

Lecture

9 pages

Homework

Homework

15 pages

Lecture

Lecture

16 pages

Problem

Problem

15 pages

Load more
Download lecture
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view lecture and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view lecture 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?