DOC PREVIEW
U of I CS 498 - Lecture 15

This preview shows page 1-2-23-24 out of 24 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 24 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 24 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 24 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 24 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 24 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

BLAST:Basic Local Alignment Search ToolAltschul et al. J. Mol Bio. 1990.Motivation• Sequence homology to a known proteinsuggest function of newly sequencedprotein• Bioinformatics task is to findhomologous sequence in a database ofsequences• Databases of sequences growing fastAlignment• Natural approach to check if the “querysequence” is homologous to asequence in the database is to computealignment score of the two sequences• Alignment score counts gaps(insertions, deletions) and replacements• Minimizing the evolutionary distanceAlignment• Global alignment: optimize the overallsimilarity of the two sequences• Local alignment: find only relativelyconserved subsequences• Local similarity measures preferred fordatabase searches– Distantly related proteins may only share isolatedregions of similarityAlignment• Dynamic programming is the standardapproach to sequence alignment• Algorithm is quadratic in length of thetwo sequences• Not practical for searches against verylarge database of sequences (e.g.,whole genome)Scoring alignments• Scoring matrix: 4 x 4 matrix (DNA) or 20x 20 matrix (protein)• Amino acid sequences: “PAM” matrix– Consider amino acid sequence alignment forvery closely related proteins, extractreplacement frequencies (probabilities),extrapolate to greater evolutionary distances• DNA sequences: match = +5, mismatch= -4BLAST: the MSP• Given two sequences of same length, thesimilarity score of their alignment (withoutgaps) is the sum of similarity values for eachpair of aligned residues• Maximal segment pair (MSP): Highest scoringpair of identical length segments from the twosequences• The similarity score of an MSP is called theMSP score• BLAST heuristically aims to maximize thisLocally maximal segment pair• A segment pair (segments of identicallengths) is locally maximal if its scorecannot be improved by extending orshortening in either direction• BLAST attempts to find all locallymaximal segment pairs above somescore cutoff.Rapid approximation of MSP score• Goal is to report those databasesequences that have MSP score abovesome threshold S.• Statistics tells us what is the highestthreshold S at which “chancesimilarities” are likely to appearRapid approximation of MSP score• BLAST minimizes time spent on database sequenceswhose similarity with the query has little chance ofexceeding this cutoff S.• Main strategy: seek only segment pairs (one fromdatabase, one query) that contain a word pair withscore >= T• Intuition: If the sequence pair has to score above S,its most well matched word (of some predeterminedsmall length) must score above T• Lower T => Fewer false negatives• Lower T => More pairs to analyzeImplementation1. Compile a list of high scoring words2. Scan database for hits to this word list3. Extend hitsCompiling list of words• Protein: List of all w-length words thatscore at least T when compared tosome word in queryScanning the database for hits• Find exact matches to list words• Can be done in linear time• Each word in list points to alloccurrences of the word in querysequenceExtending hits• Once a word pair with score >= T has beenfound, extend it in each direction.• Extend until score >= S is obtained• During extension, score may go up, and thendown, and then up again• Terminate if it goes down too much (a certaindistance below the best score found forshorter extensions)• One implementation allows gaps duringextensionBLAST: approximating the MSP• BLAST may not find all segment pairsabove threshold S• Trying to approximate the MSP• Bounds on the error: not hard bounds,but statistical bounds– “Highly likely” to find the MSPStatistics• Suppose the MSP has been calculated byBLAST (and suppose this is the true MSP)• Suppose this observed MSP scores S.• What are the chances that the MSP score fortwo unrelated sequences would be >= S?• If the chances are very low, then we can beconfident that the two sequences must nothave been unrelatedStatistics• Given two random sequences oflengths m and n• Probability that they will produce anMSP score of >= x ?Statistics• Number of separate SPs with score >=x is Poisson distributed with mean y(x)= Kmn exp(-λx)• where λ is the positive solution of∑pipjexp(λs(i,j)) = 1• s(i,j) is the scoring matrix, pi is thefrequency of i in random sequencesStatistics• Poisson distribution:Pr(x) = (e- λ λx)/x!• Pr(#SPs >= α)= 1 - Pr(#SPs <= α-1)! = 1"e"yyii!i= 0#"1$= 1" e"yyii!i= 0#"1$Statistics• For α =1, Pr(#SPs >= 1) = 1-e-y(x)• Choose S such that 1-e-y(S) is small• Suppose the probability of having at least 1 MSP withscore >= S is 0.001.• This seems reasonably small• However, if you test 10000 random sequences, youexpect 10 to cross the threshold• Therefore, require “E-value” to be small.• That is, expected number of random sequence pairswith score >= S should be small.More statistics• We just saw how to choose threshold S• How to choose T ?• BLAST is trying to find segment pairs(SPs) scoring above S• If an SP scores S, what is theprobability that it will have a w-wordmatch of score T or more?• We want this probability to be highMore statistics: Choosing T• Given a segment pair (from two randomsequences) that scores S, what is theprobability q that it will have no w-wordmatch scoring above T?• Want this q to be low• Obtained from simulations• Found to decrease exponentially as SincreasesBLAST is the universally usedbioinformatics


View Full Document

U of I CS 498 - Lecture 15

Documents in this Course
Lecture 5

Lecture 5

13 pages

LECTURE

LECTURE

39 pages

Assurance

Assurance

44 pages

LECTURE

LECTURE

36 pages

Pthreads

Pthreads

29 pages

Load more
Download Lecture 15
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture 15 and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture 15 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?