DOC PREVIEW
FSU CIS 5930r - Cluster Analysis

This preview shows page 1-2-3-4 out of 13 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 13 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 13 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 13 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 13 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 13 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Cluster AnalysisCluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods Hierarchical Methods Density-Based Methods Grid-Based Methods Model-Based Clustering Methods Outlier Analysis SummaryHierarchical Clustering Use distance matrix as clustering criteria. This method does not require the number of clusters kas an input, but needs a termination condition Step 0Step 1 Step 2 Step 3 Step 4bdceaa bd ec d ea b c d eStep 4Step 3 Step 2 Step 1 Step 0agglomerative(AGNES)divisive(DIANA)AGNES (Agglomerative Nesting) Implemented in statistical analysis packages, e.g., Splus Use the Single-Link method and the dissimilarity matrix.  Merge objects that have the least dissimilarity Go on in a non-descending fashion Eventually all objects belong to the same cluster Single-Link: each time merge the clusters (C1,C2) which are connected by the shortest single link of objects, i.e., minpC1,qC2dist(p,q)0123456789100 1 2 3 4 5 6 7 8 9 100123456789100 1 2 3 4 5 6 7 8 9 100123456789100 1 2 3 4 5 6 7 8 9 10Decompose data objects into a several levels of nested partitioning (tree of clusters), called a dendrogram. A clustering of the data objects is obtained by cuttingthe dendrogram at the desired level, then each connected component forms a cluster.E.g., level 1 gives 4 clusters: {a,b},{c},{d},{e},level 2 gives 3 clusters: {a,b},{c},{d,e}level 3 gives 2 clusters: {a,b},{c,d,e}, etc.A Dendrogram Shows How the Clusters are Merged Hierarchicallya b c d eabdeclevel 1level 2level 3level 4DIANA (Divisive Analysis) Implemented in statistical analysis packages, e.g., Splus Inverse order of AGNES Eventually each node forms a cluster on its own0123456789100 1 2 3 4 5 6 7 8 9 100123456789100 1 2 3 4 5 6 7 8 9 100123456789100 1 2 3 4 5 6 7 8 9 10More on Hierarchical Clustering Methods Major weakness of agglomerative clustering methods do not scale well: time complexity of at least O(n2), where n is the number of total objects can never undo what was done previously Integration of hierarchical with distance-based clustering BIRCH (1996): uses CF-tree and incrementally adjusts the quality of sub-clusters CURE (1998): selects well-scattered points from the cluster and then shrinks them towards the center of the cluster by a specified fraction CHAMELEON (1999): hierarchical clustering using dynamic modelingBIRCH (1996) Birch: Balanced Iterative Reducing and Clustering using Hierarchies, by Zhang, Ramakrishnan, Livny (SIGMOD’96) Incrementally construct a CF (Clustering Feature) tree, a hierarchical data structure for multiphase clustering Phase 1: scan DB to build an initial in-memory CF tree (a multi-level compression of the data that tries to preserve the inherent clustering structure of the data)  Phase 2: use an arbitrary clustering algorithm to cluster the leaf nodes of the CF-tree  Scales linearly: finds a good clustering with a single scan and improves the quality with a few additional scans Weakness: handles only numeric data, and sensitive to the order of the data records, no good if non-spherical clusters.Clustering Feature VectorClustering Feature: CF = (N, LS, SS)N: Number of data pointsLS: Ni=1 XiSS: Ni=1 (Xi )20123456789100 1 2 3 4 5 6 7 8 9 10CF = (5, (16,30),244)(3,4)(2,6)(4,5)(4,7)(3,8)Some Characteristics of CFVs Two CFVs can be aggregated. Given CF1=(N1, LS1, SS1), CF2 = (N2, LS2, SS2), If combined into one cluster, CF=(N1+N2, LS1+LS2, SS1+SS2). The centroid and radius can both be computed from CF. centroid is the center of the cluster radius is the average distance between an object and the centroid.Other statistical features as well...NNiixx10NRNiixx210)(CF-Tree in BIRCHA CF tree is a height-balanced tree that stores the clustering features for a hierarchical clustering  A nonleaf node in a tree has descendants or “children” The nonleaf nodes store sums of the CFs of their children A CF tree has two parameters Branching factor: specify the maximum number of children. threshold T: max radius of sub-clusters stored at the leaf nodesCF Tree (a multiway tree, like the B-tree)CF1child1CF3child3CF2child2CF6child6CF1child1CF3child3CF2child2CF5child5CF1CF2CF6prev nextCF1CF2CF4prev nextRootNon-leaf nodeLeaf node Leaf nodeCF-Tree Construction Scan through the database once. For each object, insert into the CF-tree as follows: At each level, choose the sub-tree whose centroid is closest. In a leaf page, choose a cluster that can absort it (new radius < T). If no cluster can absorb it, create a new cluster. Update upper


View Full Document

FSU CIS 5930r - Cluster Analysis

Download Cluster Analysis
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Cluster Analysis and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Cluster Analysis 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?