OSU BMI 731 - Effective Indexing and Filtering for Similarity Search in Large Biosequence Databases

Unformatted text preview:

Effective Indexing and Filtering for Similarity Search in Large Biosequence DatabasesSlide 2Queries in general2 kinds of queriesSlide 5Slide 6Slide 7Measuring similarityDPAString/Genome DataHow to Handle SizeSlide 13Big Picture General Approach step by stepSlide 15Slide 16Slide 171ST Step: Partitioning into overlapping WindowsSlide 19Slide 20Different transformations & Distance FunctionsDifferent transformations & Distance Functions 2Distance Functions on the Vector SpacesFrequency Distance FDn Algorithm Example (n=1)FDn Why lower bound?Wavelet Distance WDn Algorithm Example (n=1)WDn Why lower bound?Slide 28Experiment DesignSlide 30Slide 31Sorted GraphsSlide 33Slide 34Nature of the distance functionsSlide 36Slide 37ResultsComparison of index structuresFuture WorkEffective Indexing and Filtering for Similarity Search in Large Biosequence DatabasesO. Ozturk and H. Ferhatosmanoglu. IEEE International Symp. on Bioinformatics and Bioengineering (BIBE '03), pp. 359-366. Washington, DC. March 2003.BMI 731 - Winter'04 2Overview •Applications of queries•Background on queries•Current problem •Solutions and our solution•Comparison experiments and results•Future workBMI 731 - Winter'04 3Queries in general•We need a metric distance function–To measure the (dis)similarity btw objects•Dynamic programming Algorithm–O( |string1| * |string2| ) time and space•i.e. O(n2) where n is length of the strings –Especially bad for genetic sequence queries where you have long sequencesBMI 731 - Winter'04 42 kinds of queries-range queries–Retrieve all objects similar to query more than a certain degree BMI 731 - Winter'04 52 kinds of queriesk-nearest neighbor (k-NN) queries–Retrieve k most similar objects•No domain knowledge necessaryEx: 4 NNBMI 731 - Winter'04 62 kinds of queries-range queries•Requires domain knowledge–Data distribution & Distance definition too smallNone returnedBMI 731 - Winter'04 72 kinds of queries-range queries too largeAll returnedBMI 731 - Winter'04 8Measuring similarity•We need a metric distance function–To measure the (dis)similarity btw objects•Edit Distance (ED)–Three kinds of operations•Insert, delete, replace–ACTTAGC to AATGATAG–A C T - - T A G C R I I D  ED = 4 A A T G A T A G -– Dynamic programming Algorithm– O(mn) time and spaceBMI 731 - Winter'04 9DPABMI 731 - Winter'04 11String/Genome Data•Asks the most similar substrings in the database to the given string.•BLAST has -range queries–Naïve search (linear scan)–scalability problems •How to Handle Size–Partial information rather than whole database •Approximate the string data (compress) may fit in memory may be used for indexing, clusteringBMI 731 - Winter'04 12How to Handle Size•3 approaches to make use of compressed data1. Prune irrelevant data, I/O for non-pruned entries  calculate exact values for non-pruned (especially -range queries)2. Get approximate answers, virtually no I/O (I/O only for answers)(especially k-NN queries)3. Approximate pruning for -range queriesBMI 731 - Winter'04 13Overview •Background on queries•Current problem •Transformation and Indexing •Comparison experiments and results•Future workBMI 731 - Winter'04 14Big PictureGeneral Approach step by step•Transform (large) string data into (hopefully smaller sized) multi-dimensional vectors •Develop a distance function df in vector spaces to approximate the string similarity •Build a multi-dimensional indexing technique on top of multi-dimensional vectors -Preprocessing-•Implement one of the three approaches mentioned -Query-BMI 731 - Winter'04 15String DatabaseOverlapping WindowsWindowing1MultidimentionalVectorsIndexed with respect to some distance functionTransformation Into vectorSpace Indexing32PreprocessingBMI 731 - Winter'04 16Index of vectorsTransformationApproximateQuery(k-NN or -range)Query sequence1Index of vectorsExact Query(k-NN or -range)2a2bDoneThe vectors returned represent most of k-NN (or vectors in -range ) + some false positivesCandidate set Using the indexContinuedBMI 731 - Winter'04 17Calculate ED for each of them. (Remove false positives.)RefineI/O for strings represented by those vectors.3Candidate set Using the indexBMI 731 - Winter'04 181ST Step: Partitioning into overlapping Windows•AACCGGTTACGTACGT…•AACCGGTTACGTACGT…•AACCGGTTACGTACGT…e.g W=6e.g =2BMI 731 - Winter'04 192ND Step: Mapping Windows into Vector Space•Choose a tuple size k•Associate an int to each 4k k-tuples•Frequencies of those k-tuples, is the vector•If k=2  4k=16 k-tuples•AA, AC, AG, AT, •CA, CC, CG, CT•TA, TC, TG, TT•GA, GC, GG, GTBMI 731 - Winter'04 20Example Mapping•The integers assigned•AA=0, AC=1, AG=2, AT=3, •CA=4, CC=5, CG=6, CT=7•TA=8, TC=9, TG=10, TT=11•GA=12, GC=13, GG=14, GT=15•Assume window AACCGG•AA, AC, CC, CG, GG all occur once•1100011000100000 is the matching vector.BMI 731 - Winter'04 21Different transformations & Distance Functions•Tuple size  transformation size–1  4 (frequencies of A, C, G, T) FV1–2  16 (frequencies of 2-tuples) FV2BMI 731 - Winter'04 22Different transformations & Distance Functions 2•WVn transformation–String into halves x,y–FVns for x,yFVx,FVy–Concatenate addition and subtraction of them [ FVx + FVy, FVx-FVy]•Wavelet 1 on example–TCACTTAG–1st: divide into halves & find FV1 transformation•x:TCAC  1 2 0 1•y:TTAG  1 0 1 2 –2nd: add and subtract•2 2 1 3 0 2 –1 –1 WV1•Same operations on 2-tuples WV2BMI 731 - Winter'04 23Distance Functions on the Vector Spaces•All of them are proved to be lower-bounds to edit-distance•FD1  distance on FV1•FD2  distance on FV2•WD1  distance on WV1•WD2  distance on WV2BMI 731 - Winter'04 24Frequency Distance FDnAlgorithm Example (n=1)FDn (n-gram frequencies u,v)•posDist:=negDist:=0•for all dimensions ui,vi –If ui>vi then posDist:=ui-vi–elsenegDist:=ui-vi•Return max(posDist, negDist)/n•u:ACTTAGC2,2,1,2 v:AATGATAG4,0,2,2• – 2-4<0 negDist+=|2-4|–2-0>0 posDist+=|2-0|–1-2<0 negDist+=|1-2|–2-2=0 •posDist:2 negDist:3•FD1 is 3BMI 731 - Winter'04 25FDn Why lower bound? •On example–need to incresase A by 2 G by 1 3–need to decrease c by 2•We may “increase+decrease” if we can replace (back to slide #8)•So in best case


View Full Document

OSU BMI 731 - Effective Indexing and Filtering for Similarity Search in Large Biosequence Databases

Download Effective Indexing and Filtering for Similarity Search in Large Biosequence Databases
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Effective Indexing and Filtering for Similarity Search in Large Biosequence Databases and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Effective Indexing and Filtering for Similarity Search in Large Biosequence Databases 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?