U of I CS 498 - Lecture 15 - D1666922

Home> Schools> University of Illinois> Computer Science (CS) > CS 498> Lecture 15

DOC PREVIEW

U of I CS 498 - Lecture 15

School name University of Illinois

Course Cs 498- Special Topics

Pages 24

This preview shows page 1-2-23-24 out of 24 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 24 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 24 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 24 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 24 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 24 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

BLAST:Basic Local Alignment Search ToolAltschul et al. J. Mol Bio. 1990.Motivation• Sequence homology to a known proteinsuggest function of newly sequencedprotein• Bioinformatics task is to findhomologous sequence in a database ofsequences• Databases of sequences growing fastAlignment• Natural approach to check if the “querysequence” is homologous to asequence in the database is to computealignment score of the two sequences• Alignment score counts gaps(insertions, deletions) and replacements• Minimizing the evolutionary distanceAlignment• Global alignment: optimize the overallsimilarity of the two sequences• Local alignment: find only relativelyconserved subsequences• Local similarity measures preferred fordatabase searches– Distantly related proteins may only share isolatedregions of similarityAlignment• Dynamic programming is the standardapproach to sequence alignment• Algorithm is quadratic in length of thetwo sequences• Not practical for searches against verylarge database of sequences (e.g.,whole genome)Scoring alignments• Scoring matrix: 4 x 4 matrix (DNA) or 20x 20 matrix (protein)• Amino acid sequences: “PAM” matrix– Consider amino acid sequence alignment forvery closely related proteins, extractreplacement frequencies (probabilities),extrapolate to greater evolutionary distances• DNA sequences: match = +5, mismatch= -4BLAST: the MSP• Given two sequences of same length, thesimilarity score of their alignment (withoutgaps) is the sum of similarity values for eachpair of aligned residues• Maximal segment pair (MSP): Highest scoringpair of identical length segments from the twosequences• The similarity score of an MSP is called theMSP score• BLAST heuristically aims to maximize thisLocally maximal segment pair• A segment pair (segments of identicallengths) is locally maximal if its scorecannot be improved by extending orshortening in either direction• BLAST attempts to find all locallymaximal segment pairs above somescore cutoff.Rapid approximation of MSP score• Goal is to report those databasesequences that have MSP score abovesome threshold S.• Statistics tells us what is the highestthreshold S at which “chancesimilarities” are likely to appearRapid approximation of MSP score• BLAST minimizes time spent on database sequenceswhose similarity with the query has little chance ofexceeding this cutoff S.• Main strategy: seek only segment pairs (one fromdatabase, one query) that contain a word pair withscore >= T• Intuition: If the sequence pair has to score above S,its most well matched word (of some predeterminedsmall length) must score above T• Lower T => Fewer false negatives• Lower T => More pairs to analyzeImplementation1. Compile a list of high scoring words2. Scan database for hits to this word list3. Extend hitsCompiling list of words• Protein: List of all w-length words thatscore at least T when compared tosome word in queryScanning the database for hits• Find exact matches to list words• Can be done in linear time• Each word in list points to alloccurrences of the word in querysequenceExtending hits• Once a word pair with score >= T has beenfound, extend it in each direction.• Extend until score >= S is obtained• During extension, score may go up, and thendown, and then up again• Terminate if it goes down too much (a certaindistance below the best score found forshorter extensions)• One implementation allows gaps duringextensionBLAST: approximating the MSP• BLAST may not find all segment pairsabove threshold S• Trying to approximate the MSP• Bounds on the error: not hard bounds,but statistical bounds– “Highly likely” to find the MSPStatistics• Suppose the MSP has been calculated byBLAST (and suppose this is the true MSP)• Suppose this observed MSP scores S.• What are the chances that the MSP score fortwo unrelated sequences would be >= S?• If the chances are very low, then we can beconfident that the two sequences must nothave been unrelatedStatistics• Given two random sequences oflengths m and n• Probability that they will produce anMSP score of >= x ?Statistics• Number of separate SPs with score >=x is Poisson distributed with mean y(x)= Kmn exp(-λx)• where λ is the positive solution of∑pipjexp(λs(i,j)) = 1• s(i,j) is the scoring matrix, pi is thefrequency of i in random sequencesStatistics• Poisson distribution:Pr(x) = (e- λ λx)/x!• Pr(#SPs >= α)= 1 - Pr(#SPs <= α-1)! = 1"e"yyii!i= 0#"1$= 1" e"yyii!i= 0#"1$Statistics• For α =1, Pr(#SPs >= 1) = 1-e-y(x)• Choose S such that 1-e-y(S) is small• Suppose the probability of having at least 1 MSP withscore >= S is 0.001.• This seems reasonably small• However, if you test 10000 random sequences, youexpect 10 to cross the threshold• Therefore, require “E-value” to be small.• That is, expected number of random sequence pairswith score >= S should be small.More statistics• We just saw how to choose threshold S• How to choose T ?• BLAST is trying to find segment pairs(SPs) scoring above S• If an SP scores S, what is theprobability that it will have a w-wordmatch of score T or more?• We want this probability to be highMore statistics: Choosing T• Given a segment pair (from two randomsequences) that scores S, what is theprobability q that it will have no w-wordmatch scoring above T?• Want this q to be low• Obtained from simulations• Found to decrease exponentially as SincreasesBLAST is the universally usedbioinformatics

View Full Document


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2-23-24 out of 24 pages.

U of I CS 498 - Lecture 15

Sign up for free to view:

Please select your school