UCSD CSE 182 - Lecture - D2481550

Home> Schools> University of California, San Diego> Computer Science & Engineering (CSE) > CSE 182> Lecture

DOC PREVIEW

UCSD CSE 182 - Lecture

School name University of California, San Diego

Course Cse 182- Biological Databases

Pages 18

This preview shows page 1-2-3-4-5-6 out of 18 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 18 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 18 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 18 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 18 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 18 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 18 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 18 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Slide 1Why is BLAST fast?Keyword MatchingSilly QuizExpectation?P-value computationWhat is a distribution functionP-valueP-value computationZ-scores for alignmentBlast E-valueAltschul Karlin statisticsExponential distributionSlide 14Blast E-valueThe last step in BlastKeyword searchSlide 181/13/19 CSE182CSE182-L6P-value and E-valueDicitionary matchingPattern matching1/13/19 CSE 182Why is BLAST fast?•Assume that keyword searching does not consume any time and that alignment computation the expensive step.•Query m=1000, random Db n=107, no TP•SW = O(nm) = 1000*107 = 1010 computations•BLAST, W=11•E(#11-mer hits)= 1000* (1/4)11 * 107=2384•Number of computations = 2384*100*100=2.384*107•Ratio=1010/(2.384*107)=420•Further speed improvements are possible1/13/19 CSE 182Keyword Matching•How fast can we match keywords?•Hash table/Db index? What is the size of the hash table, for m=11•Suffix trees? What is the size of the suffix trees?•Trie based search. We will do this in class.AATCA5671/13/19 CSE182Silly QuizSkin patternsFacial FeaturesExpectation?•Some quantities can be reasonably guessed by taking a statistical sample, others not–Average weight of a group of 100 people–Average height of a group of 100 people–Average grade on a test•Give an example of a quantity that cannot.•When the distribution, and the expectation is known, it is easy to see when you see something significant.•If the distribution is not well understood, or the wrong distribution is chosen, a wrong conclusion can be drawn1/13/19 CSE 1821/13/19 CSE 182P-value computation•BLAST: The matching regions are expanded into alignments, which are scored using SW, and an appropriate scoring matrix.•The results are presented in order of decreasing scores•The score is just a number.•How significant is the top scoring hits if it has a score S?•Expect/E-value (score S)= Number of times we would expect to see a random query generate a score S, or better•How can we compute E-value?1/13/19 CSE 182What is a distribution function•Given a collection of numbers (scores)–1, 2, 8, 3, 5, 3,6, 4, 4,1,5,3,6,7,….•Plot its distribution as follows:–X-axis =each number–Y-axis (count/frequency/probability) of seeing that number–More generally, the x-axis can be a range to accommodate real numbersP-value•P-value: probability that a specific value (11) is achieved by chance.•Compute an scores obtained by chance – 1, 2, 8, 3, 5, 3,6,12,4, 4,1,5,3,6,7•Compute a Distribution•1-2 XXX•3-4 XXXX•5-6 XXXX•7-8 XX•9-10•11-13 X•15-17•Are1/13/19 CSE 1821/13/19 CSE 182P-value computation•A simple empirical method:•Compute a distribution of scores against a random database.•Use an estimate of the area under the curve to get the probability.•OR, fit the distribution to one of the standard distributions.1/13/19 CSE 182Z-scores for alignment•Initial assumption was that the scores followed a normal distribution.•Z-score computation:–For any alignment, score S, shuffle one of the sequences many times, and recompute alignment. Get mean and standard deviation–Look up a table to get a P-value€ ZS=S − μσ1/13/19 CSE 182Blast E-value•Initial (and natural) assumption was that scores followed a Normal distribution•1990, Karlin and Altschul showed that ungapped local alignment scores follow an exponential distribution•Practical consequence: –Longer tail. –Previously significant hits now not so significant1/13/19 CSE 182Altschul Karlin statistics•For simplicity, assume that the database is a binary string, and so is the query.–Let match-score=1, –mismatch score=- , –indel=- (No gaps allowed)•What does it mean to get a score k?1/13/19 CSE 182Exponential distribution•Random Database, Pr(1) = p •What is the expected number of hits to a sequence of k 1’s•Instead, consider a random binary Matrix. Expected # of diagonals of k 1s€ (n − k) pk≅ nek ln p= ne−k ln1p ⎛ ⎝ ⎜ ⎞ ⎠ ⎟€ Λ =(n − k)(m − k) pk≅ nmek ln p= nme−k ln1p ⎛ ⎝ ⎜ ⎞ ⎠ ⎟1/13/19 CSE 182•As you increase k, the number decreases exponentially.•The number of diagonals of k runs can be approximated by a Poisson process•In ungapped alignments, we replace the coin tosses by column scores, but the behaviour does not change (Karlin & Altschul). •As the score increases, the number of alignments that achieve the score decreases exponentially € Pr[u] =Λue−Λu!Pr[u > 0] =1− e−Λ1/13/19 CSE 182Blast E-value•Choose a score such that the expected score between a pair of residues < 0•Expected number of alignments with a score S•For small values, E-value and P-value are the same€ E = Kmne−λS= mn2−λS−ln Kln 2 ⎛ ⎝ ⎜ ⎞ ⎠ ⎟Pr(S ≥ x) =1− e−Kmne− λx1/13/19 CSE 182The last step in Blast•We have discussed–Alignments–Db filtering using keywords–Scoring matrices–E-values and P-values•The last step: Database filtering requires us to scan a large sequence fast for matching keywordsFa05 CSE 182Keyword search•Recall: In BLAST, we get a collection of keywords from the query sequence, and identify all db locations with an exact match to the keyword.•Question: Given a collection of strings (keywords), find all occurrences in a database string where they keyword might match.•End of lecture 61/13/19

View Full Document


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2-3-4-5-6 out of 18 pages.

UCSD CSE 182 - Lecture

Sign up for free to view:

Please select your school