CSE 634 Data Mining Techniques Professor Anita Wasilewska SUNY Stony BrookReferencesOutlineWhat is Cluster Analysis?Slide 5Examples of Clustering ApplicationsSlide 7Data StructuresTypes of data in clustering analysisInterval-scaled attributesSimilarity and Dissimilarity Between ObjectsSimilarity and Dissimilarity Between Objects (Cont.)Binary AttributesDissimilarity between Binary AttributesNominal AttributesOrdinal AttributesRatio-Scaled AttributesAttributes of Mixed TypesSlide 19Clustering in Real DatabasesSlide 21Slide 22Clustering RequirementsMajor Clustering ApproachesPartitioning Algorithms: Basic ConceptThe K-Means Clustering MethodThe k-means algorithm:Slide 28Simple k-means Example(k=2)Slide 30Slide 31Slide 32Cluster of ObjectsWeaknesses of the K-Means MethodHierarchical ClusteringAGNES-ExploredAGNESSimilarity/Distance metricsSingle Linkage Hierarchical ClusteringSlide 40Slide 41Slide 42Slide 43DIANA (Divisive Analysis)OverviewOverview-contdDensity-Based Clustering MethodsDensity-Based Clustering: BackgroundDensity-Based Clustering: Background (II)DBSCAN: The AlgorithmDBSCAN: Density Based Spatial Clustering of Applications with NoiseGrid-Based Clustering MethodCLIQUE (CLustering In QUEst)CLIQUE: The Major StepsSlide 55Strength and Weakness of CLIQUESlide 57Outlier DiscoveryOutlier Discovery: Statistical ApproachesOutlier Discovery: Distance-Based ApproachOutlier Discovery: Deviation-Based ApproachSlide 62SummarySlide 64CSE 634 Data Mining Techniques Professor Anita WasilewskaSUNY Stony BrookCLUSTER ANALYSISBy: Arthy Krishnamurthy & Jing TunSpring 2005References•Jiawei Han and Michelle Kamber. Data Mining Concept and Techniques (Chapter8). Morgan Kaufman, 2002.•M. Ester, H.P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering clusters in large spatial databases. KDD'96. http://ifsc.ualr.edu/xwxu/publications/kdd-96.pdf•K-means and Hierachical Clustering. Statistical data mining tutorial slides by Andrew Moore: http://www-2.cs.cmu.edu/~awm/tutorials/kmeans.html•How to explain hierarchical clustering. http://www.analytictech.com/networks/hiclus.htm•Teknomo, Kardi. K-means Clustering Numerical Example. http://people.revoledu.com/kardi/tutorial/kMean/NumericalExample.htmOutline•What is Cluster Analysis?•Applications •Data Types and Distance Metrics•Clustering in Real Databases•Major Clustering Methods•Outlier Analysis•SummaryWhat is Cluster Analysis?•Cluster: a collection of data objects•Similar to the objects in the same cluster (Intraclass similarity)•Dissimilar to the objects in other clusters (Interclass dissimilarity)•Cluster analysis•Statistical method for grouping a set of data objects into clusters•A good clustering method produces high quality clusters with high intraclass similarity and low interclass similarity •Clustering is unsupervised classification•Can be a stand-alone tool or as a preprocessing step for other algorithmsOutline•What is Cluster Analysis?•Applications •Data Types and Distance Metrics•Clustering in Real Databases•Major Clustering Methods•Outlier Analysis•SummaryExamples of Clustering Applications•Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs•Insurance: Identifying groups of motor insurance policy holders with a high average claim cost•City-planning: Identifying groups of houses according to their house type, value, and geographical location•Earth-quake studies: Observed earth quake epicenters should be clustered along continent faultsOutline•What is Cluster Analysis?•Applications •Data Types and Distance Metrics•Clustering in Real Databases•Major Clustering Methods•Outlier Analysis•SummaryData Structures•Data matrix o1•p=attributes …•n=# of objects oi … •Dissimilarity matrix•d(i,j)=difference/dissimilaritybetween i and jnpx...nfx...n1x...............ipx...ifx...i1x...............1px...1fx...11x0...)2,()1,(:::)2,3()...ndnd0dd(3,10d(2,1)0Types of data in clustering analysis•Interval-scaled attributes:•Binary attributes:•Nominal, ordinal, and ratio attributes:•Attributes of mixed types:Interval-scaled attributes•Continuous measurements of a roughly linear scale•E.g. weight, height, temperature, etc.•Standardize data in preprocessing so that all attributes have equal weight•Exceptions: height may be a more important attribute associated with basketball playersSimilarity and Dissimilarity Between Objects•Distances are normally used to measure the similarity or dissimilarity between two data objects (objects=records)•Minkowski distance:where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-dimensional data objects, and q is a positive integer•If q = 1, d is Manhattan distanceqqppqqjxixjxixjxixjid )||...|||(|),(2211||...||||),(2211 ppjxixjxixjxixjid Similarity and Dissimilarity Between Objects (Cont.)•If q = 2, d is Euclidean distance:•Properties•d(i,j) 0•d(i,i) = 0•d(i,j) = d(j,i)•d(i,j) d(i,k) + d(k,j)•Can also use weighted distance, or other dissimilarity measures.)||...|||(|),(2222211 ppjxixjxixjxixjid Binary Attributes•A contingency table for binary data•Simple matching coefficient (if the binary attribute is symmetric):•Jaccard coefficient (if the binary attribute is asymmetric): dcbacb jid),(pdbcasumdcdcbabasum0101cbacb jid),(Object iObject jDissimilarity between Binary Attributes•Example i j•gender is a symmetric attribute•remaining attributes are asymmetric•let the values Y and P be set to 1, and the value N be set to 0Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4Jack M Y N P N N NMary F Y N P N P NJim M Y P N N N N75.021121),(67.011111),(33.010210),(maryjimdjimjackdmaryjackdNominal Attributes•A generalization of the binary attribute in that it can take more than 2 states, e.g., red, yellow, blue, green•Method 1: Simple matching•m: # of attributes that are same for both records, p: total # of attributes•Method 2: rewrite the database and create a new binary attribute for each of the m states•For an object with color yellow, the yellow attribute is set to 1, while the remaining attributes are
View Full Document