CMSC 858W Lecture Notes Mihai PopAdministriviaBaeza-Yates MethodBaeza-Yates MethodBLASTSpaced seeds and LSHSpaced seeds and LSHSpaced seeds and LSHAnother observationCMSC 858W Lecture Notes Mihai Pop3/25/2010Prepared by Ferhan TureAdministrivia•Project proposal• Due tomorrow (3/26)• Doesn’t need to be formal• Project deliverables (details later)• Report (most points from this one)• Academic paper style (~3-4 pages or more)• Cover basics of the ideas• Code• DocumentationBaeza-Yates Method• For doing alignment in cases where it’s known there will be at most k mismatches.• Based on the pigeonhole principle: If |P| is divided into k+1 equally-sized blocks, then at least one block does not contain any mismatches (i.e., there is an exact alignment for that block).• We can run an exact alignment algorithm (e.g. Aho- Corasick) on each of the k+1 blocks. Once we hit a block that aligns exactly to the corresponding part of the text, we run the expensive dynamic programming algorithm:PPTBaeza-Yates Method•The aligned shaded region is called a “seed”• Pattern is more likely to align to a region close to the seed: “first find short exact match, then extend to longer alignment”• Instead of |T|x|P| running time, the dynamic programming takes |P|x|P| running time.• It is guaranteed to not miss an alignmentBLAST•Heavily used alignment program• Tries all possible mutations of seeds• Assumes exact alignment of seeds, which are defined as a sequence of consecutive k characters.• Depending on k, there are two problems with this:• There may be too many matches for a short seed• Some regions may be missed if a long seed is used• IDEA: Use longer but nonconsecutive seeds by using the Locality Sensitive Hashing (LSH) approach.Spaced seeds and LSH•Idea is to hash similar k-mers to the same hash value.• If we use a hash function that keeps all characters but the third one (i.e., h(AGGAC)=h(AGCAC)=AGAC))• The hash function is called a “spaced seed”. It can be denoted by 1s and 0s (e.g. h = 11011)• How many spaces should we have and where should we put them? • If we want to find alignments with at least p% identity, then the proportion of 1s should be p%.• In order to determine the spaced seeds in a formal way, we’ll use dynamic programming.ATAGGACTT AAGCACP:T:Spaced seeds and LSH•M = length of seed• L = length of alignment• p = proportion of 1s in seed• i = 1...L-M• A = alignment• fs(i,b)=probability of finding a compatible seed s in first i positions of alignment when last M bits is b• Note: Seed s is not given. The idea is to iterate over possible s values, check if each is compatible or not, and sum the probability of finding a compatible one, given a position i and sequence b.• e.g., ACGATTCAGTCAAC TATGCATTCA110110110111ibAseeds0000...101010111110... incompatiblecompatiblecompatiblecompatibleSpaced seeds and LSH•Dynamic programming can solve this problem efficiently:• By dynamic programming, calculate for each seed s,• sumb fs(L-M, b) = Pr(seed s matches A) = sensitivity of s• Select the seed with highest sensitivity.• This needs to be done only once for a given p, as a preprocessing step. It is an NP-hard problem, therefore heuristics are used to speed up.fs(0,b) = #seeds compatible with A[1...M] / #possible seeds = #seeds compatible with A[1...M] / 2Mfs(i,b) = fs(i,0b’) x (1-p)+ fs(i,1b’) x pii-1b’bAnother observation• Observation by previous student Mohammad Ghodsi:• If there is an alignment of length L with k mismatches (r=k/L is the error rate), there is at least one sub-alignment of length M<L with error rate less than or equal to r.• Why useful? When searching through suffix tree, discard searches when you hit to a point with much higher error rate than
View Full Document