U of I CS 498 - Basic Local Alignment Search Tool - D2905611

Home> Schools> University of Illinois> Computer Science (CS) > CS 498> Basic Local Alignment Search Tool

DOC PREVIEW

U of I CS 498 - Basic Local Alignment Search Tool

School name University of Illinois

Course Cs 498- Special Topics

Pages 24

This preview shows page 1-2-23-24 out of 24 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 24 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 24 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 24 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 24 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 24 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio. 1990.MotivationAlignmentSlide 4Slide 5Scoring alignmentsBLAST: the MSPLocally maximal segment pairRapid approximation of MSP scoreSlide 10ImplementationCompiling list of wordsScanning the database for hitsExtending hitsBLAST: approximating the MSPStatisticsSlide 17Slide 18Slide 19Slide 20More statisticsMore statistics: Choosing TBLAST is the universally used bioinformatics toolhttp://flybase.net/blast/BLAST:Basic Local Alignment Search ToolAltschul et al. J. Mol Bio. 1990.Motivation•Sequence homology to a known protein suggest function of newly sequenced protein•Bioinformatics task is to find homologous sequence in a database of sequences•Databases of sequences growing fastAlignment•Natural approach to check if the “query sequence” is homologous to a sequence in the database is to compute alignment score of the two sequences•Alignment score counts gaps (insertions, deletions) and replacements•Minimizing the evolutionary distanceAlignment•Global alignment: optimize the overall similarity of the two sequences•Local alignment: find only relatively conserved subsequences•Local similarity measures preferred for database searches–Distantly related proteins may only share isolated regions of similarityAlignment•Dynamic programming is the standard approach to sequence alignment•Algorithm is quadratic in length of the two sequences•Not practical for searches against very large database of sequences (e.g., whole genome)Scoring alignments•Scoring matrix: 4 x 4 matrix (DNA) or 20 x 20 matrix (protein)•Amino acid sequences: “PAM” matrix–Consider amino acid sequence alignment for very closely related proteins, extract replacement frequencies (probabilities), extrapolate to greater evolutionary distances•DNA sequences: match = +5, mismatch = -4BLAST: the MSP•Given two sequences of same length, the similarity score of their alignment (without gaps) is the sum of similarity values for each pair of aligned residues•Maximal segment pair (MSP): Highest scoring pair of identical length segments from the two sequences•The similarity score of an MSP is called the MSP score•BLAST heuristically aims to maximize thisLocally maximal segment pair•A segment pair (segments of identical lengths) is locally maximal if its score cannot be improved by extending or shortening in either direction•BLAST attempts to find all locally maximal segment pairs above some score cutoff.Rapid approximation of MSP score•Goal is to report those database sequences that have MSP score above some threshold S.•Statistics tells us what is the highest threshold S at which “chance similarities” are likely to appearRapid approximation of MSP score•BLAST minimizes time spent on database sequences whose similarity with the query has little chance of exceeding this cutoff S.•Main strategy: seek only segment pairs (one from database, one query) that contain a word pair with score >= T•Intuition: If the sequence pair has to score above S, its most well matched word (of some predetermined small length) must score above T•Lower T => Fewer false negatives•Lower T => More pairs to analyzeImplementation1. Compile a list of high scoring words2. Scan database for hits to this word list3. Extend hitsCompiling list of words•Protein: List of all w-length words that score at least T when compared to some word in queryScanning the database for hits•Find exact matches to list words•Can be done in linear time•Each word in list points to all occurrences of the word in query sequenceExtending hits•Once a word pair with score >= T has been found, extend it in each direction. •Extend until score >= S is obtained•During extension, score may go up, and then down, and then up again•Terminate if it goes down too much (a certain distance below the best score found for shorter extensions)•One implementation allows gaps during extensionBLAST: approximating the MSP•BLAST may not find all segment pairs above threshold S•Trying to approximate the MSP•Bounds on the error: not hard bounds, but statistical bounds–“Highly likely” to find the MSPStatistics•Suppose the MSP has been calculated by BLAST (and suppose this is the true MSP)•Suppose this observed MSP scores S.•What are the chances that the MSP score for two unrelated sequences would be >= S?•If the chances are very low, then we can be confident that the two sequences must not have been unrelatedStatistics•Given two random sequences of lengths m and n•Probability that they will produce an MSP score of >= x ?Statistics•Number of separate SPs with score >= x is Poisson distributed with mean y(x) = Kmn exp(-x) •where  is the positive solution of ∑pipjexp(s(i,j)) = 1•s(i,j) is the scoring matrix, pi is the frequency of i in random sequencesStatistics•Poisson distribution:Pr(x) = (e-  x)/x!•Pr(#SPs >= ) = 1 - Pr(#SPs <= -1)€ =1−e−yyii!i = 0α−1∑=1− e−yyii!i= 0α−1∑Statistics•For =1, Pr(#SPs >= 1) = 1-e-y(x)•Choose S such that 1-e-y(S) is small•Suppose the probability of having at least 1 MSP with score >= S is 0.001.•This seems reasonably small•However, if you test 10000 random sequences, you expect 10 to cross the threshold•Therefore, require “E-value” to be small.•That is, expected number of random sequence pairs with score >= S should be small.More statistics•We just saw how to choose threshold S•How to choose T ?•BLAST is trying to find segment pairs (SPs) scoring above S•If an SP scores S, what is the probability that it will have a w-word match of score T or more? •We want this probability to be highMore statistics: Choosing T•Given a segment pair (from two random sequences) that scores S, what is the probability q that it will have no w-word match scoring above T?•Want this q to be low •Obtained from simulations•Found to decrease exponentially as S increasesBLAST is the universally used bioinformatics

View Full Document