October 09 CSE182 CSE182-L6 P-value and E-value Dicitionary matching Pattern matchingOctober 09 CSE 182 Why is BLAST fast? • Assume that keyword searching does not consume any time and that alignment computation the expensive step. • Query m=1000, random Db n=107, no TP • SW = O(nm) = 1000*107 = 1010 computations • BLAST, W=11 • E(#11-mer hits)= 1000* (1/4)11 * 107=2384 • Number of computations = 2384*100*100=2.384*107 • Ratio=1010/(2.384*107)=420 • Further speed improvements are possibleOctober 09 CSE 182 Keyword Matching • How fast can we match keywords? • Hash table/Db index? What is the size of the hash table, for m=11 • Suffix trees? What is the size of the suffix trees? • Trie based search. We will do this in class. AATCA 567October 09 CSE182 Silly Quiz Skin patterns Facial FeaturesExpectation? • Some quantities can be reasonably guessed by taking a statistical sample, others not – Average weight of a group of 100 people – Average height of a group of 100 people – Average grade on a test • Give an example of a quantity that cannot. • When the distribution, and the expectation is known, it is easy to see when you see something significant. • If the distribution is not well understood, or the wrong distribution is chosen, a wrong conclusion can be drawn October 09 CSE 182October 09 CSE 182 P-value computation • BLAST: The matching regions are expanded into alignments, which are scored using SW, and an appropriate scoring matrix. • The results are presented in order of decreasing scores • The score is just a number. • How significant is the top scoring hits if it has a score S? • Expect/E-value (score S)= Number of times we would expect to see a random query generate a score S, or better • How can we compute E-value?October 09 CSE 182 What is a distribution function • Given a collection of numbers (scores) – 1, 2, 8, 3, 5, 3,6, 4, 4,1,5,3,6,7,…. • Plot its distribution as follows: – X-axis =each number – Y-axis (count/frequency/probability) of seeing that number – More generally, the x-axis can be a range to accommodate real numbersP-value • P-value: probability that a specific value (11) is achieved by chance. • Compute an scores obtained by chance – 1, 2, 8, 3, 5, 3,6,12,4, 4,1,5,3,6,7 • Compute a Distribution • 1-2 XXX!• 3-4 XXXX!• 5-6 XXXX!• 7-8 XX!• 9-10!• 11-13 X!• 15-17!• Are!October 09 CSE 182October 09 CSE 182 P-value computation • A simple empirical method: • Compute a distribution of scores against a random database. • Use an estimate of the area under the curve to get the probability. • OR, fit the distribution to one of the standard distributions.October 09 CSE 182 Z-scores for alignment • Initial assumption was that the scores followed a normal distribution. • Z-score computation: – For any alignment, score S, shuffle one of the sequences many times, and recompute alignment. Get mean and standard deviation – Look up a table to get a P-value € ZS=S −µσOctober 09 CSE 182 Blast E-value • Initial (and natural) assumption was that scores followed a Normal distribution • 1990, Karlin and Altschul showed that ungapped local alignment scores follow an exponential distribution • Practical consequence: – Longer tail. – Previously significant hits now not so significantOctober 09 CSE 182 Altschul Karlin statistics • For simplicity, assume that the database is a binary string, and so is the query. – Let match-score=1, – mismatch score=- ∞, – indel=-∞ (No gaps allowed) • What does it mean to get a score k?October 09 CSE 182 Exponential distribution • Random Database, Pr(1) = p • What is the expected number of hits to a sequence of k 1’s • Instead, consider a random binary Matrix. Expected # of diagonals of k 1s € (n − k) pk≅ nek ln p= ne−k ln1p € Λ = (n − k)(m − k) pk≅ nmek ln p= nme−k ln1p October 09 CSE 182 • As you increase k, the number decreases exponentially. • The number of diagonals of k runs can be approximated by a Poisson process • In ungapped alignments, we replace the coin tosses by column scores, but the behaviour does not change (Karlin & Altschul). • As the score increases, the number of alignments that achieve the score decreases exponentially € Pr[u] =Λue−Λu!Pr[u > 0] = 1− e−ΛOctober 09 CSE 182 Blast E-value • Choose a score such that the expected score between a pair of residues < 0 • Expected number of alignments with a score S • For small values, E-value and P-value are the same € E = Kmne−λS= mn2−λS−ln Kln 2 Pr(S ≥ x) = 1− e−Kmne−λxOctober 09 CSE 182 The last step in Blast • We have discussed – Alignments – Db filtering using keywords – Scoring matrices – E-values and P-values • The last step: Database filtering requires us to scan a large sequence fast for matching keywordsFa05 CSE 182 Keyword search • Recall: In BLAST, we get a collection of keywords from the query sequence, and identify all db locations with an exact match to the keyword. • Question: Given a collection of strings (keywords), find all occurrences in a database string where they keyword might match.• End of lecture 6 October 09
View Full Document