1Pairwise sequence alignment global and localMultiple sequence alignmentlocalglobalSubstitution matricesDatabase searchingBLASTEvolutionary tree reconstructionRNA structure predictionGene FindingProtein structure predictionSequence statisticsComputational genomics…1. BLAST (Karlin‐Altschul) StatisticsExpected number of ungapped alignments with score S found with random sequences is:E = Kmn e‐λSNote that E is proportional to the size of the search space, mn, and decreases exponentially with the score, Swhere K is a constant that depends on S[i,j] and can be computed from the theory for any scoring function. The parameter λ is specified by the equation1=Σpipje–λS[ij]2E‐values depend on K and λ, '2SmnEwhich in turn depend on the scoring matrix, S[i,j]. With normalized bit scores, E values from database searches with different scoring matrices can be compared.SKmneEBy normalizing the alignment scores2lnln'KSSwe obtain E values that are independent of K, λ and S[i,j].2. Normalized bit scores3. Target frequenciesGiven sequences a and b:•Alternate hypothesis (Ĥa): a and b are related at n PAMs divergence.– Residues i and j are aligned with “target” frequencies, qnij•Null hypothesis (Ĥ0): a and b are unrelated.– Residues i and j are aligned with background frequencies, pipjjpipijqjiSnn2log],[ Note that the PAM and BLOSUM matrices were constructed by estimating qijfrom data. However, any scoring matrix (that satisfies the appropriate assumptions for Karlin Altschul statistics) can be expressed as a log odds matrix of the form 3The frequencies qijin the above equation are the characteristic target frequencies of the matrix S[] . In other words, qijis the frequency with which i is aligned with j in Maximal Segment Pairs (MSPs) obtained with S[]. Recall that an MSP is “the highest scoring pair of identical length segments chosen from two sequences. The boundaries of an MSP are chosen to maximize its score, so an MSP may be of any length.”1],[ jinSeppqjiijnTarget frequencies for substitution matrix S[]can be estimated empirically as follows:• Generate “random” sequences from background probabilities• Find MSPs in pairs of random sequences using S[] to score alignments• Count target frequencies in those MSPsTarget frequencies can also be estimated theoretically using the equation:1Altschul et al. J Mol Biol 215: 403‐10 (1990)“Theorem” (Karlin and Altschul, 1990)The best scoring matrix for distinguishing significant alignments from chance alignments is the scoring matrix that gives the greatest difference in scores between related alignments and chance alignments. For sequences diverged by n PAMs, the best discrimination is obtained by the matrix corresponding to the qnijfrom related sequences at the evolutionary distance of interest.jpipijqjiSnn2log],[ 4Proof by contradiction:– Suppose 1. S*[ ] is the matrix that best distinguishes chance alignments from related alignments at a given evolutionary distance and let2. the frequencies of observing i paired with j in MSPs (locally maximal ungapped alignments) obtained with S*[ ] are not q*ij.–Then there exists some x and y in Σ that are aligned in MSPs with a frequency greater than q*ij.–We can increase the score of the MSPs by increasing the score for aligning xwith y , indicating that S*[x,y] does not have the best discriminatory power, leading to a contradiction.],[*jiSeppqjiijImplicationsBLAST will give reasonable accuracy as long as the empirical target frequencies, qij,, in the alignments of interest do not deviate too far from the theoretical target frequencies:],[ jiSeppqjiijReasonable accuracy can be achieved with two or three matrices.5The average score (in bits) per alignment position when using a PAM Mmatrix to compare sequences in fact separated by D PAMs(Calculated by simulation)Efficiency = Score with PAM MScore with PAM D= 94% efficiency Choosing your scoring matrix1. BLAST will give reasonable accuracy as long as the empirical target frequencies do not deviate too far from the theoretical target frequencies• Use PAM40, PAM120 & PAM240, or PAM120 & PAM2502. The lower the relative entropy, H, the longer the minimum alignment that is distinguishable from chance.6• The scale indicates % identity between aligned sequences• The Twilight Zone– Around 20%-35% identity– “Random” alignments begin to appearThe “Twilight” Zone1007550250Pairwise searchesPSSM/HMM searchesStructure prediction% IdentityTwilight zone4. Information content of substitution matrices and alignmentsRecall that the expected number of ungapped alignments with score S found with random sequences is: E = Kmn e‐λS. Intuitively, (for a fixed database) the probability of finding a significant match depends on the length of the query, m, and the score of the alignment, S. S, in turn, depends on the ratio between qijand pipj. Another way to think about this: the probability of finding a significant match depends the amount of information that can be derived from an alignment in order to discriminate between related sequences (Ĥa) and chance alignments (Ĥ0). To ask this question, we can take advantage of a formalism from information theory: Relative Entropy.7Relative Entropy also called the Kullback‐Leibler DivergenceSuppose you have a general framework with an alternate hypothesis Ĥaand a null hypothesis Ĥ0.Event i occurs with probability pi under the null hypothesis and probability qiunder the alternate hypothesis The relative entropy,iiiipqqH logis the expected discrimination information: the information available to discriminate in favor of hypothesis Ĥaagainst hypothesis Ĥ0, given Ĥais true. In the BLAST context, the relative entropy gives information available to distinguish related alignments at n PAMs (Ĥa) from chance alignments (Ĥ0) given a particular substitution matrix Sn[]jinijnjijiijnijnjiSqppqqH,,],[logBLOSUM PAM Sequence identity bits/site bits/site20 2.95 83%30 2.5760 2.00 63%90 1.18 100 1.18 43%80 0.99 120 0.98 38%60 0.66 160 0.70 30%50 0.52 200 0.51 25%45 0.38 250 0.36 20%The relative entropy of a substitution matrix is given in bits per position and can be calculated from S[] using the equationsjinijnjiSqH,],[],[ jinSeppqjiijnand8How long does a sequence have to be in order to find a statistically significant match at a given evolutionary divergence? One
View Full Document