Birch: An efficient data clustering method for very large databasesOutlineWhat is Data Clustering?Data ClusteringSome Clustering ApplicationsData Clustering – previous approachesApproachesClustering parametersSlide 9Birch’s goals:Clustering Features (CF)Clustering Feature (CF)CF Additivity TheoremProperties of CF-TreeCF Tree InsertionBirch Clustering AlgorithmBirch – Phase 1Birch - Phase 2Birch – Phase 3Birch – Phase 4Experimental ResultsSlide 22Slide 23Slide 24Slide 25ConclusionBirch: An efficient data clustering method for very large databasesBy Tian Zhang, Raghu RamakrishnanPresented by Hung LaiOutlineWhat is data clusteringData clustering applicationsPrevious Approaches and problemsBirch’s GoalClustering FeatureBirch clustering algorithmExperiment results and conclusionWhat is Data Clustering?A cluster is a closely-packed group.A collection of data objects that are similar to one another and treated collectively as a group.Data Clustering is the partitioning of a dataset into clustersData ClusteringHelps understand the natural grouping or structure in a datasetProvided a large set of multidimensional data–Data space is usually not uniformly occupied–Identify the sparse and crowded places–Helps visualizationSome Clustering ApplicationsBiology – building groups of genes with related patternsMarketing – partition the population of consumers to market segmentsDivision of WWW pages into genres.Image segmentations – for object recognitionLand use – Identification of areas of similar land use from satellite imagesInsurance – Identify groups of policy holders with high average claim costData Clustering – previousapproachesProbability based (Machine learning): make wrong assumption that distributions on attributes are independent on each otherProbability representations of clusters are expensiveApproachesDistance Based (statistics)Must be a distance metric between two itemsAssumes that all data points are in memory and can be scanned frequentlyIgnores the fact that not all data points are equally importantClose data points are not gathered togetherInspects all data points on multiple iterationsThese approaches do not deal with dataset and memory size issues!Clustering parametersCentroid – Euclidian centerRadius – average distance to centerDiameter – average pair wise difference within a clusterRadius and diameter are measures of the tightness of a cluster around its center. We wish to keep these low.Clustering parametersOther measurements (like the Euclidean distance of the centroids of two clusters) will measure how far away two clusters are.A good quality clustering will produce high intra-clustering and low interclusteringA good quality clustering can help find hidden patternsBirch’s goals:Minimize running time and data scans, thus formulating the problem for large databasesClustering decisions made without scanning the whole dataExploit the non uniformity of data – treat dense areas as one, and remove outliers (noise)Clustering Features (CF)CF is a compact storage for data on points in a clusterHas enough information to calculate the intra-cluster distancesAdditivity theorem allows us to merge sub-clustersClustering Feature (CF)Given N d-dimensional data points in a cluster: {Xi} where i = 1, 2, …, N,CF = (N, LS, SS)N is the number of data points in the cluster,LS is the linear sum of the N data points,SS is the square sum of the N data points.CF Additivity TheoremIf CF1 = (N1, LS1, SS1), andCF2 = (N2 ,LS2, SS2) are the CF entries of two disjoint sub-clusters.The CF entry of the sub-cluster formed by merging the two disjoin sub-clusters is:CF1 + CF2 = (N1 + N2 , LS1 + LS2, SS1 + SS2)Properties of CF-TreeEach non-leaf node has at most B entriesEach leaf node has at most L CF entries which each satisfy threshold TNode size is determined by dimensionality of data space and input parameter P (page size)CF Tree InsertionIdentifying the appropriate leaf: recursively descending the CF tree and choosing the closest child node according to a chosen distance metricModifying the leaf: test whether the leaf can absorb the node without violating the threshold. If there is no room, split the nodeModifying the path: update CF information up the path.Birch Clustering AlgorithmPhase 1: Scan all data and build an initial in-memory CF tree.Phase 2: condense into desirable length by building a smaller CF tree.Phase 3: Global clusteringPhase 4: Cluster refining – this is optional, and requires more passes over the data to refine the resultsBirch – Phase 1Start with initial threshold and insert points into the treeIf run out of memory, increase thresholdvalue, and rebuild a smaller tree by reinserting values from older tree and then other valuesGood initial threshold is important but hard to figure outOutlier removal – when rebuilding tree remove outliersBirch - Phase 2OptionalPhase 3 sometime have minimum size which performs well, so phase 2 prepares the tree for phase 3. Removes outliers, and grouping clusters.Birch – Phase 3Problems after phase 1:–Input order affects results–Splitting triggered by node sizePhase 3:–cluster all leaf nodes on the CF values according to an existing algorithm–Algorithm used here: agglomerative hierarchical clusteringBirch – Phase 4OptionalDo additional passes over the dataset & reassign data points to the closest centroid from phase 3 Recalculating the centroids and redistributing the items.Always converges (no matter how many time phase 4 is repeated)Experimental ResultsCreate 3 synthetic data sets for testing–Also create an ordered copy for testing input orderKMEANS and CLARANS require entire data set to be in memory–Initial scan is from disk, subsequent scans are in memoryExperimental ResultsIntended clusteringExperimental ResultsKMEANS clusteringDS Time D # Scan DS Time D # Scan1 43.9 2.09 289 1o 33.8 1.97 1972 13.2 4.43 51 2o 12.7 4.20 293 32.9 3.66 187 3o 36.0 4.35 241Experimental ResultsCLARANS clusteringDS Time D # Scan DS Time D # Scan1 932 2.10 3307 1o 794 2.11 28542 758 2.63 2661 2o 816 2.31 29333 835 3.39 2959 3o 924 3.28 3369Experimental ResultsBIRCH clusteringDS Time D # Scan DS Time D # Scan1 11.5 1.87 2 1o 13.6 1.87 22 10.7 1.99 2 2o 12.1 1.99 23 11.4 3.95 2 3o 12.2 3.99
View Full Document