Cluster AnalysisCluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods Hierarchical Methods Density-Based Methods Grid-Based Methods Model-Based Clustering Methods Outlier Analysis SummaryWhat is Cluster Analysis? Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Grouping a set of data objects into clusters Clustering is unsupervised classification: no predefined classes Clustering is used: As a stand-alone tool to get insight into data distribution Visualization of clusters may unveil important information As a preprocessing step for other algorithms Efficient indexing or compression often relies on clusteringGeneral Applications of Clustering Pattern Recognition Spatial Data Analysis create thematic maps in GIS by clustering feature spaces detect spatial clusters and explain them in spatial data mining Image Processing cluster images based on their visual content Economic Science (especially market research) WWW and IR document classification cluster Weblog data to discover groups of similar access patternsWhat Is Good Clustering? A good clustering method will produce high quality clusters with high intra-class similarity low inter-class similarity The quality of a clustering result depends on both the similarity measure used by the method and its implementation. The quality of a clustering method is also measured by its ability to discover some or all of the hidden patterns.Requirements of Clustering in Data Mining Scalability Ability to deal with different types of attributes Discovery of clusters with arbitrary shape Minimal requirements for domain knowledge to determine input parameters Able to deal with noise and outliers Insensitive to order of input records High dimensionality Incorporation of user-specified constraints Interpretability and usabilityOutliers Outliers are objects that do not belong to any cluster or form clusters of very small cardinality In some applications we are interested in discovering outliers, not clusters (outlier analysis)clusteroutliersCluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods Hierarchical Methods Density-Based Methods Grid-Based Methods Model-Based Clustering Methods Outlier Analysis SummaryData Structures data matrix (two modes) dissimilarity or distancematrix (one mode)npx...nfx...n1x...............ipx...ifx...i1x...............1px...1fx...11x0...)2,()1,(:::)2,3()...ndnd0dd(3,10d(2,1)0the “classic” data inputattributes/dimensionstuples/objectsthe desired data input to some clustering algorithmsobjectsobjectsMeasuring Similarity in Clustering Dissimilarity/Similarity metric: The dissimilarity d(i, j) between two objects i and j is expressed in terms of a distance function, which is typically a metric: d(i, j)0 (non-negativity) d(i, i)=0 (isolation) d(i, j)= d(j, i) (symmetry) d(i, j) ≤ d(i, h)+d(h, j) (triangular inequality) The definitions of distance functions are usually different for interval-scaled, boolean, categorical, ordinal and ratio-scaled variables. Weights may be associated with different variables based on applications and data semantics.Type of data in cluster analysis Interval-scaled variables e.g., salary, height Binary variables e.g., gender (M/F), has_cancer(T/F) Nominal (categorical) variables e.g., religion (Christian, Muslim, Buddhist, Hindu, etc.) Ordinal variables e.g., military rank (soldier, sergeant, lutenant, captain, etc.) Ratio-scaled variables population growth (1,10,100,1000,...) Variables of mixed types multiple attributes with various typesSimilarity and Dissimilarity Between Objects Distance metrics are normally used to measure the similarity or dissimilarity between two data objects The most popular conform to Minkowski distance:where i = (xi1, xi2, …, xin) and j = (xj1, xj2, …, xjn) are two n-dimensional data objects, and p is a positive integer If p = 1, L1is the Manhattan (or city block) distance:ppjnxinxpjxixpjxixjipL/1||...|22||11|),(||...||||),(12211 nn jxixjxixjxixjiL Similarity and Dissimilarity Between Objects (Cont.) If p = 2, L2is the Euclidean distance: Properties d(i,j) 0 d(i,i) = 0 d(i,j) = d(j,i) d(i,j) d(i,k) + d(k,j) Also one can use weighted distance:)||...|||(|),(2222211 nn jxixjxixjxixjid )||...||2||1(),(2222211 nn jxixnwjxixwjxixwjid Binary Variables A binary variable has two states: 0 absent, 1 present A contingency table for binary data Simple matching coefficient distance (invariant, if the binary variable is symmetric): Jaccard coefficient distance (noninvariant if the binary variable is asymmetric): dcbacb jid),(cbacb jid),(pdbcasumdcdcbabasum0101object iobject jBinary Variables Another approach is to define the similarity of two objects and not their distance. In that case we have the following: Simple matching coefficient similarity: Jaccard coefficient similarity:dcbada jis),(cbaa jis),(Note that: s(i,j) = 1 – d(i,j)Dissimilarity between Binary Variables Example (Jaccard coefficient) all attributes are asymmetric binary 1 denotes presence or positive test 0 denotes absence or negative testName Fever Cough Test-1 Test-2 Test-3 Test-4 Jack 1 0 1 0 0 0 Mary 1 0 1 0 1 0 Jim 1 1 0 0 0 0 75.021121),(67.011111),(33.010210),(maryjimdjimjackdmaryjackd Each variable is mapped to a bitmap (binary vector) Jack: 101000 Mary: 101010 Jim: 110000 Simple match distance: Jaccard coefficient:A simpler definitionName Fever Cough Test-1 Test-2 Test-3 Test-4 Jack 1 0 1 0 0 0 Mary 1 0 1 0 1 0 Jim 1 1 0 0 0 0 bits ofnumber totalpositionsbit common -non ofnumber ),( jid in s1' ofnumber in s1' ofnumber 1),(jijijidVariables of Mixed
View Full Document