New version page

Princeton COS 551 - database

Documents in this Course
Load more
Upgrade to remove ads
Upgrade to remove ads
Unformatted text preview:

COS 551: Introduction to Computational Molecular BiologyDatabase SearchingFASTABLASTThe basic idea:Local Alignment StatisticsIf m is the size of the query and n is the size of the database, then you can show that by chance the expected number of hits with score at least s is given by:Kmne-(sK and ( are (computed) parameters that are depend on the substitution matrices and the probability of each amino acid occuring. This gives us the e-score. As a sanity check, we observe that if we double the size of the database, we double the numberWe are comparing the query with many strings, and must correct for that.BLAST in practiceReferencesCOS 551: Introduction to Computational Molecular Biology Lecture: Oct 17, 2000 Lecturer: Mona Singh Scribe: Jacob Brenner1 Database Searching • In database search, we typically have a large sequence database (e.g., hundreds of thousands of sequences) • Generally want to find similarity between a new query sequence and some known sequence in database. • We look for similarity in local alignments (not global), because similarities often span only segments of the sequences involved. The reason the sequence similarity spans such short stretches is because of the modular fashion of protein evolution (mixing and matching of domains) • Issues to consider: 1) Efficiency – We need to do many comparisons, so cannot use the optimal but relatively slow (quadratic) algorithms discussed so far (e.g., Smith-Waterman) 2) Significance – How do we tell when we find a significant match? • Methods: FASTA : http://www2.ebi.ac.uk/fasta3 Pearson & Lipman (1988). PNAS 85, 2444-2448. BLASTA : http://www.ncbi.nlm.nih.gov/BLAST Altschul et al. (1990). JMB 215, 403-440. Both methods are 10-100x faster than running Smith-Waterman. They are heuristics, so can also overlook (occasionally) similarities Smith-Waterman will find (i.e., they sacrifice sensitivity for speed). We will now review the basic ideas behind the FASTA and BLAST heuristics, and then give a quick introduction to the statistics of local alignments. FASTA Query sequence s Target or text seq t • Search 2 sequences for common window of fixed size k For proteins k = 2 For DNA k = 4-6 • Compare by sliding windows • All k-tuples are stored in a table 1 Note: these lecture notes have not been carefully edited, so there may be some mistakes.Here is the idea of FASTA method: Example: 1 2 3 4 5 6 7 s = A G A G A G t = A A G A G A G • Create a table of possible k-tuples (say, 4-tuples) for both s and t e.g., possible 4-tuples of s are (formed by sliding a window of size 4): s = A G A G A G • Each entry is the location within the sequence at which the k-tuple occurs k-tuple Positions at which k-tuple occurs AGAG s: 1,3 t: 2,4 GAGA s: 2 t: 3 AAGA t: 1 NOTE: You can pre-compute such a table for a database and use it repeatedly on separate occasions, or you can perform a linear scan of the text database with each use. • What are the good “offsets” for s and t? Keep track of an OFFSET vector, which ranges from 1 - |t| to |s| - 1 Initially, the vector = 0 • Now, find “hot spots”: For each table entry, look at one from s (position i) and one from t (at position j), and increment i-j in our offset table. AGAG: 1 – 2 = -1 1 – 4 = -3 3 – 2 = 1 3 – 4 = -1 GAGA: 2 – 3 = -1So now our offset table looks like: -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 (increment size) 0 3 1 # of occurrences • Can compute the offset vector by looking at just k-tuples occuring in the query sequences; do not have to look at all k-tuples • Back to our example, the largest offset is –1, which indicates the best (gapless) shift of t with respect to s is –1. Our example again: s: - A G A G A G t: A A G A G A G Note the relationship between the offsets and diagonals: A A G A G A G A G A G A G -1th diagonal 0th diagonal 0th diagonal: s & t line up from the start s AGAGAG- t AAGAGAG -1 diagonal: s -AGAGAG t AAGAGAG -2 diagonal: s --AGAGAG t AAGAGAG- 1st diagonal: s AGAGAG-- t -AAGAGAGSo finding a good offset requires finding a good diagonal. A dense diagonal has many k-tuple hits Actual method used: 1) FASTA method picks out best diagonal runs (score is based on number of “hotspots” and number of spaces between hotspots) and combines them into regions. 2) Pick best-scoring (say) 10 regions and use a substitution matrix (PAM) to find subalignments of maximal score 3) Allow gaps using “banding”(explained below) • So far we have not included gaps in the analysis. To allow gaps, use banding. • Only fill out (in our DP matrix) diagonals that are l away from the best alignment. Example: below, the solid line is the best alignment and l = 2 • We will expand the dynamic programming matrix only within these diagonals. Thus we are not filling out the entire matrix, only O(ln) of it. • Tradeoffs: The width of the band l is a trade off between optimality and speed k-tuple: increase k ⇒ lower sensitivity (need exact match) decrease k ⇒ increase sensitivity; more hotspots;more time BLAST The basic idea: 1) Condiser k-tuples in the query sequence. For each k-tuple, make a list of all other k-tuples (i.e., “seed words”) that have score greater than or equal to T when compared to it using a particular substitution matrix. Example:query = ARNDD k=4, T=15, matrix=PAM-120 There are 2 4 tuples: ARND, RNDD Seed words for ARND are: PAM-120 score ARND 18 SRND 16 ARDD 16 .. .. 2) Search for seed words in the database – these are called“hits.” There are a few ways to do this quickly - one is to build a deterministic finite automata that accepts for the appropriate words, and we search through the entire sequence database. Alternatively, we may do a table look up (which is pre-computed for database) • Say k = 4 • Map each word into a number (<204)indexed by 4-tuple protein sequence • Then each word’s entry in the table points to “hits” in the database • Note: only a subset of 204 words actually occurs. Therefore use hashing to build a smaller table. 3) BLAST then extends hits in both directions (building up amino acid by amino acid) • The algorithm stops if the current score is significantly less than the


View Full Document
Download database
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view database and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view database 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?