DOC PREVIEW
CMU CS 15826 - Multimedia Databases and Data Mining

This preview shows page 1-2 out of 7 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 7 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 7 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 7 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

C. Faloutsos1CMU SCS15-826: Multimedia Databases and Data MiningFractals - case studies - IIC. Faloutsos15-826 Copyright: C. Faloutsos (2005) 2CMU SCSOutlineGoal: ‘Find similar / interesting things’• Intro to DB• Indexing - similarity search• Data Mining15-826 Copyright: C. Faloutsos (2005) 3CMU SCSIndexing - Detailed outline• primary key indexing• secondary key / multi-key indexing• spatial access methods– z-ordering– R-trees– misc• fractals– intro– applications• text• ...15-826 Copyright: C. Faloutsos (2005) 4CMU SCSIndexing - Detailed outline• fractals– intro– applications• disk accesses for R-trees (range queries)• dimensionality reduction• selectivity in M-trees• dim. curse revisited• “fat fractals”• quad-tree analysis [Gaede+]15-826 Copyright: C. Faloutsos (2005) 5CMU SCSMetric trees - analysis• Problem: How many disk accesses, for an M-tree?• Given:– N (# of objects)– C (fanout of disk pages)– r (radius of range query - BIASED model)15-826 Copyright: C. Faloutsos (2005) 6CMU SCSMetric trees - analysis• Problem: How many disk accesses, for an M-tree?• Given:– N (# of objects)– C (fanout of disk pages)– r (radius of range query - BIASED model)• NOT ENOUGH - what else do we need?C. Faloutsos215-826 Copyright: C. Faloutsos (2005) 7CMU SCSMetric trees - analysis• A: something about the distribution15-826 Copyright: C. Faloutsos (2005) 8CMU SCSMetric trees - analysis• A: something about the distribution[Zezula+, PODS98]: assumed that the distance distribution is the same, for every object:F1(d) = Prob(an object is within d from object #1)= F2(d) = ... = F(d)15-826 Copyright: C. Faloutsos (2005) 9CMU SCSMetric trees - analysis• A: something about the distribution• Given our ‘fractal’ tools, we could try them - which one?15-826 Copyright: C. Faloutsos (2005) 10CMU SCSMetric trees - analysis• A: something about the distribution• Given our ‘fractal’ tools, we could try them -which one?• A: Correlation integral [Traina+, ICDE2000]15-826 Copyright: C. Faloutsos (2005) 11CMU SCSMetric trees - analysislog(r)0 0.5 1 1.5 2 2.5 3 3.505101520Slope = 4.754EnglishWords datasetEnglish dictionarylog(r)0 0 .5 1 1.5 2 2 .5 3 3.505101520Slope = 6.686 64Portuguese datasetPortuguese dictionarylog(d)log(#pairs)log(d)log(#pairs)15-826 Copyright: C. Faloutsos (2005) 12CMU SCSlog(r)7 8 9 10 11 1 20246810121416Slope = 5.267Eigenfaces datasete)Metric trees - analysislog(r)0 0.5 1 1.5 2 2 .5 305101520Slope = 4.828Divina Comedia datasetlog(d)log(d)log(#pairs)log(#pairs)Divina ComediaEigenfacesC. Faloutsos315-826 Copyright: C. Faloutsos (2005) 13CMU SCSlog(r)0 0.5 1 1.5 2 2.5 305101520Slope = 5.124Decamero n datas etlog(r)0 0 .5 1 1 .5 2 2. 5 3 3.505101520Slope = 4.754EnglishWords d atasetlog(r)-10 -8 -6 -4 -2 00510152025Slope = 1.764MGCounty datase tlog(r)-3.5 -3 -2.5 -2 -1.5 - 1 -0.5 0 0.5024681012Slope = 5.30FaceIt data setlog(r)-10 -8 -6 -4 -2 005101520Slope = 1. 585Sierpinsky datas etlog(r)7 8 9 10 11 120246810121416Slope = 5. 267Eigenfaces dat asete)log(r)-10 -8 -6 -4 -2 005101520Slope = 0.989Line 2D data setlog(r)0 0.5 1 1.5 2 2.5 3 3.505101520Slope = 6.68664Portuguese datasetlog(r)0 0.5 1 1.5 2 2.5 305101520Slope = 4.8 28Divina Comedia dataseta)b) c)d)f)g)h)i)log(r)-8 -7 -6 -5 -4 -3 -2 -1 005101520Slope = 1.947Uniform 2D datasetj)15-826 Copyright: C. Faloutsos (2005) 14CMU SCSMetric trees - analysisData Set N(# Objects)Dimension DistanceFunctionDistanceExponent DReal Metric datasetsEnglish 25,143 NA LEdit 4.753DivinaCommedia12,701 NA LEdit 4.827Decamerone 18,719 NA LEdit 5.124Portuguese 21,473 NA LEdit 6.686FaceIt 1,056 NA Not divulged 6.821Real vector datasetsMGCounty 15,559 2 L2 1.752Eigenfaces 11,900 16 L25.267Synthetic datasetsSierpinsky 9,841 2 L21.5842D Line 20,000 2 L20.989Uniform 2D 10,000 2 L21.947Table 2 - Datasets used in the experiments. 15-826 Copyright: C. Faloutsos (2005) 15CMU SCSMetric trees - analysis• So, what is the # of disk accesses, for a node of radius rd, on a query of radius rq?rdrq15-826 Copyright: C. Faloutsos (2005) 16CMU SCSMetric trees - analysis• So, what is the # of disk accesses, for a node of radius rd, on a query of radius rq?• A: ~ (rd+rq)....rdrq15-826 Copyright: C. Faloutsos (2005) 17CMU SCSMetric trees - analysis• So, what is the # of disk accesses, for a node of radius rd, on a query of radius rq?• A: ~ (rd+rq)^Drdrq15-826 Copyright: C. Faloutsos (2005) 18CMU SCSAccuracy of selectivity formulasLegend:  D A Measured ± SD DA for D measured DA for D measured from M-Tr eeRadius0.0001 0.001 0.01 0.11101001000Line20KRadius0.0001 0 .001 0.01 0.11101001000Sierpinskya) b)Radius0.0001 0. 001 0.01 0.1 110100FaceIt (1 056 points)Radius0.0001 0.001 0.01 0.1 110100100010000Eigenfacesd)Radius0.0001 0.001 0.01 0.1 10.11101001000MGCountyRadius0.0001 0.001 0.01 0.1 11101001000Uniform 2Dc)f)e)log(rq)log(#d.a.)C. Faloutsos415-826 Copyright: C. Faloutsos (2005) 19CMU SCSFast estimation of D• Normally, D takes O(N^2) time• Anything faster? suppose we have already built an M-tree15-826 Copyright: C. Faloutsos (2005) 20CMU SCSFast estimation of D• Hint:15-826 Copyright: C. Faloutsos (2005) 21CMU SCSFast estimation of D• Hint:ratio of radii:r1^D * C = r2^Dr1r215-826 Copyright: C. Faloutsos (2005) 22CMU SCSIndexing - Detailed outline• fractals– intro– applications• disk accesses for R-trees (range queries)• dimensionality reduction• selectivity in M-trees• dim. curse revisited• “fat fractals”• quad-tree analysis [Gaede+]15-826 Copyright: C. Faloutsos (2005) 23CMU SCSDim. curse revisited• (Q: how serious is the dim. curse, e.g.:)• Q: what is the search effort for k-nn?– given N points, in E dimensions, in an R-tree, with k-nn queries (‘biased’ model)[Pagel, Korn + ICDE 2000]15-826 Copyright: C. Faloutsos (2005) 24CMU SCSReminder: Hausdorff Dimension (D0)• r = side length (each dimension)• B(r) = # boxes containing points ∝ rD0r = 1/2 B = 2 r = 1/4 B = 4 r = 1/8 B = 8logr = -1 logB = 1logr = -2 logB = 2logr = -3 logB = 3C. Faloutsos515-826 Copyright: C. Faloutsos (2005) 25CMU SCSReminder: Correlation Dimension (D2)• S(r) = ∑ pi2(squared % pts in box) ∝ rD2 ∝ #pairs( within <= r )r = 1/2 S = 1/2 r = 1/4 S = 1/4 r = 1/8 S = 1/8logr = -1 logS = -1logr = -2 logS = -2logr = -3 logS = -315-826 Copyright: C. Faloutsos (2005) 26CMU SCSObservation


View Full Document

CMU CS 15826 - Multimedia Databases and Data Mining

Download Multimedia Databases and Data Mining
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Multimedia Databases and Data Mining and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Multimedia Databases and Data Mining 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?