C. Faloutsos1CMU SCS15-826: Multimedia Databases and Data MiningFractals - case studies - IIC. Faloutsos15-826 Copyright: C. Faloutsos (2005) 2CMU SCSOutlineGoal: ‘Find similar / interesting things’• Intro to DB• Indexing - similarity search• Data Mining15-826 Copyright: C. Faloutsos (2005) 3CMU SCSIndexing - Detailed outline• primary key indexing• secondary key / multi-key indexing• spatial access methods– z-ordering– R-trees– misc• fractals– intro– applications• text• ...15-826 Copyright: C. Faloutsos (2005) 4CMU SCSIndexing - Detailed outline• fractals– intro– applications• disk accesses for R-trees (range queries)• dimensionality reduction• selectivity in M-trees• dim. curse revisited• “fat fractals”• quad-tree analysis [Gaede+]15-826 Copyright: C. Faloutsos (2005) 5CMU SCSMetric trees - analysis• Problem: How many disk accesses, for an M-tree?• Given:– N (# of objects)– C (fanout of disk pages)– r (radius of range query - BIASED model)15-826 Copyright: C. Faloutsos (2005) 6CMU SCSMetric trees - analysis• Problem: How many disk accesses, for an M-tree?• Given:– N (# of objects)– C (fanout of disk pages)– r (radius of range query - BIASED model)• NOT ENOUGH - what else do we need?C. Faloutsos215-826 Copyright: C. Faloutsos (2005) 7CMU SCSMetric trees - analysis• A: something about the distribution15-826 Copyright: C. Faloutsos (2005) 8CMU SCSMetric trees - analysis• A: something about the distribution[Zezula+, PODS98]: assumed that the distance distribution is the same, for every object:F1(d) = Prob(an object is within d from object #1)= F2(d) = ... = F(d)15-826 Copyright: C. Faloutsos (2005) 9CMU SCSMetric trees - analysis• A: something about the distribution• Given our ‘fractal’ tools, we could try them - which one?15-826 Copyright: C. Faloutsos (2005) 10CMU SCSMetric trees - analysis• A: something about the distribution• Given our ‘fractal’ tools, we could try them -which one?• A: Correlation integral [Traina+, ICDE2000]15-826 Copyright: C. Faloutsos (2005) 11CMU SCSMetric trees - analysislog(r)0 0.5 1 1.5 2 2.5 3 3.505101520Slope = 4.754EnglishWords datasetEnglish dictionarylog(r)0 0 .5 1 1.5 2 2 .5 3 3.505101520Slope = 6.686 64Portuguese datasetPortuguese dictionarylog(d)log(#pairs)log(d)log(#pairs)15-826 Copyright: C. Faloutsos (2005) 12CMU SCSlog(r)7 8 9 10 11 1 20246810121416Slope = 5.267Eigenfaces datasete)Metric trees - analysislog(r)0 0.5 1 1.5 2 2 .5 305101520Slope = 4.828Divina Comedia datasetlog(d)log(d)log(#pairs)log(#pairs)Divina ComediaEigenfacesC. Faloutsos315-826 Copyright: C. Faloutsos (2005) 13CMU SCSlog(r)0 0.5 1 1.5 2 2.5 305101520Slope = 5.124Decamero n datas etlog(r)0 0 .5 1 1 .5 2 2. 5 3 3.505101520Slope = 4.754EnglishWords d atasetlog(r)-10 -8 -6 -4 -2 00510152025Slope = 1.764MGCounty datase tlog(r)-3.5 -3 -2.5 -2 -1.5 - 1 -0.5 0 0.5024681012Slope = 5.30FaceIt data setlog(r)-10 -8 -6 -4 -2 005101520Slope = 1. 585Sierpinsky datas etlog(r)7 8 9 10 11 120246810121416Slope = 5. 267Eigenfaces dat asete)log(r)-10 -8 -6 -4 -2 005101520Slope = 0.989Line 2D data setlog(r)0 0.5 1 1.5 2 2.5 3 3.505101520Slope = 6.68664Portuguese datasetlog(r)0 0.5 1 1.5 2 2.5 305101520Slope = 4.8 28Divina Comedia dataseta)b) c)d)f)g)h)i)log(r)-8 -7 -6 -5 -4 -3 -2 -1 005101520Slope = 1.947Uniform 2D datasetj)15-826 Copyright: C. Faloutsos (2005) 14CMU SCSMetric trees - analysisData Set N(# Objects)Dimension DistanceFunctionDistanceExponent DReal Metric datasetsEnglish 25,143 NA LEdit 4.753DivinaCommedia12,701 NA LEdit 4.827Decamerone 18,719 NA LEdit 5.124Portuguese 21,473 NA LEdit 6.686FaceIt 1,056 NA Not divulged 6.821Real vector datasetsMGCounty 15,559 2 L2 1.752Eigenfaces 11,900 16 L25.267Synthetic datasetsSierpinsky 9,841 2 L21.5842D Line 20,000 2 L20.989Uniform 2D 10,000 2 L21.947Table 2 - Datasets used in the experiments. 15-826 Copyright: C. Faloutsos (2005) 15CMU SCSMetric trees - analysis• So, what is the # of disk accesses, for a node of radius rd, on a query of radius rq?rdrq15-826 Copyright: C. Faloutsos (2005) 16CMU SCSMetric trees - analysis• So, what is the # of disk accesses, for a node of radius rd, on a query of radius rq?• A: ~ (rd+rq)....rdrq15-826 Copyright: C. Faloutsos (2005) 17CMU SCSMetric trees - analysis• So, what is the # of disk accesses, for a node of radius rd, on a query of radius rq?• A: ~ (rd+rq)^Drdrq15-826 Copyright: C. Faloutsos (2005) 18CMU SCSAccuracy of selectivity formulasLegend: D A Measured ± SD DA for D measured DA for D measured from M-Tr eeRadius0.0001 0.001 0.01 0.11101001000Line20KRadius0.0001 0 .001 0.01 0.11101001000Sierpinskya) b)Radius0.0001 0. 001 0.01 0.1 110100FaceIt (1 056 points)Radius0.0001 0.001 0.01 0.1 110100100010000Eigenfacesd)Radius0.0001 0.001 0.01 0.1 10.11101001000MGCountyRadius0.0001 0.001 0.01 0.1 11101001000Uniform 2Dc)f)e)log(rq)log(#d.a.)C. Faloutsos415-826 Copyright: C. Faloutsos (2005) 19CMU SCSFast estimation of D• Normally, D takes O(N^2) time• Anything faster? suppose we have already built an M-tree15-826 Copyright: C. Faloutsos (2005) 20CMU SCSFast estimation of D• Hint:15-826 Copyright: C. Faloutsos (2005) 21CMU SCSFast estimation of D• Hint:ratio of radii:r1^D * C = r2^Dr1r215-826 Copyright: C. Faloutsos (2005) 22CMU SCSIndexing - Detailed outline• fractals– intro– applications• disk accesses for R-trees (range queries)• dimensionality reduction• selectivity in M-trees• dim. curse revisited• “fat fractals”• quad-tree analysis [Gaede+]15-826 Copyright: C. Faloutsos (2005) 23CMU SCSDim. curse revisited• (Q: how serious is the dim. curse, e.g.:)• Q: what is the search effort for k-nn?– given N points, in E dimensions, in an R-tree, with k-nn queries (‘biased’ model)[Pagel, Korn + ICDE 2000]15-826 Copyright: C. Faloutsos (2005) 24CMU SCSReminder: Hausdorff Dimension (D0)• r = side length (each dimension)• B(r) = # boxes containing points ∝ rD0r = 1/2 B = 2 r = 1/4 B = 4 r = 1/8 B = 8logr = -1 logB = 1logr = -2 logB = 2logr = -3 logB = 3C. Faloutsos515-826 Copyright: C. Faloutsos (2005) 25CMU SCSReminder: Correlation Dimension (D2)• S(r) = ∑ pi2(squared % pts in box) ∝ rD2 ∝ #pairs( within <= r )r = 1/2 S = 1/2 r = 1/4 S = 1/4 r = 1/8 S = 1/8logr = -1 logS = -1logr = -2 logS = -2logr = -3 logS = -315-826 Copyright: C. Faloutsos (2005) 26CMU SCSObservation
View Full Document