Unformatted text preview:

CSE 634 Data Mining Techniques Professor Anita Wasilewska SUNY Stony BrookReferencesOutlineWhat is Cluster Analysis?Slide 5Examples of Clustering ApplicationsSlide 7Data StructuresTypes of data in clustering analysisInterval-scaled attributesSimilarity and Dissimilarity Between ObjectsSimilarity and Dissimilarity Between Objects (Cont.)Binary AttributesDissimilarity between Binary AttributesNominal AttributesOrdinal AttributesRatio-Scaled AttributesAttributes of Mixed TypesSlide 19Clustering in Real DatabasesSlide 21Slide 22Clustering RequirementsMajor Clustering ApproachesPartitioning Algorithms: Basic ConceptThe K-Means Clustering MethodThe k-means algorithm:Slide 28Simple k-means Example(k=2)Slide 30Slide 31Slide 32Cluster of ObjectsWeaknesses of the K-Means MethodHierarchical ClusteringAGNES-ExploredAGNESSimilarity/Distance metricsSingle Linkage Hierarchical ClusteringSlide 40Slide 41Slide 42Slide 43DIANA (Divisive Analysis)OverviewOverview-contdDensity-Based Clustering MethodsDensity-Based Clustering: BackgroundDensity-Based Clustering: Background (II)DBSCAN: The AlgorithmDBSCAN: Density Based Spatial Clustering of Applications with NoiseGrid-Based Clustering MethodCLIQUE (CLustering In QUEst)CLIQUE: The Major StepsSlide 55Strength and Weakness of CLIQUESlide 57Outlier DiscoveryOutlier Discovery: Statistical ApproachesOutlier Discovery: Distance-Based ApproachOutlier Discovery: Deviation-Based ApproachSlide 62SummarySlide 64CSE 634 Data Mining Techniques Professor Anita WasilewskaSUNY Stony BrookCLUSTER ANALYSISBy: Arthy Krishnamurthy & Jing TunSpring 2005References•Jiawei Han and Michelle Kamber. Data Mining Concept and Techniques (Chapter8). Morgan Kaufman, 2002.•M. Ester, H.P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering clusters in large spatial databases. KDD'96. http://ifsc.ualr.edu/xwxu/publications/kdd-96.pdf•K-means and Hierachical Clustering. Statistical data mining tutorial slides by Andrew Moore: http://www-2.cs.cmu.edu/~awm/tutorials/kmeans.html•How to explain hierarchical clustering. http://www.analytictech.com/networks/hiclus.htm•Teknomo, Kardi. K-means Clustering Numerical Example. http://people.revoledu.com/kardi/tutorial/kMean/NumericalExample.htmOutline•What is Cluster Analysis?•Applications •Data Types and Distance Metrics•Clustering in Real Databases•Major Clustering Methods•Outlier Analysis•SummaryWhat is Cluster Analysis?•Cluster: a collection of data objects•Similar to the objects in the same cluster (Intraclass similarity)•Dissimilar to the objects in other clusters (Interclass dissimilarity)•Cluster analysis•Statistical method for grouping a set of data objects into clusters•A good clustering method produces high quality clusters with high intraclass similarity and low interclass similarity •Clustering is unsupervised classification•Can be a stand-alone tool or as a preprocessing step for other algorithmsOutline•What is Cluster Analysis?•Applications •Data Types and Distance Metrics•Clustering in Real Databases•Major Clustering Methods•Outlier Analysis•SummaryExamples of Clustering Applications•Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs•Insurance: Identifying groups of motor insurance policy holders with a high average claim cost•City-planning: Identifying groups of houses according to their house type, value, and geographical location•Earth-quake studies: Observed earth quake epicenters should be clustered along continent faultsOutline•What is Cluster Analysis?•Applications •Data Types and Distance Metrics•Clustering in Real Databases•Major Clustering Methods•Outlier Analysis•SummaryData Structures•Data matrix o1•p=attributes …•n=# of objects oi … •Dissimilarity matrix•d(i,j)=difference/dissimilaritybetween i and jnpx...nfx...n1x...............ipx...ifx...i1x...............1px...1fx...11x0...)2,()1,(:::)2,3()...ndnd0dd(3,10d(2,1)0Types of data in clustering analysis•Interval-scaled attributes:•Binary attributes:•Nominal, ordinal, and ratio attributes:•Attributes of mixed types:Interval-scaled attributes•Continuous measurements of a roughly linear scale•E.g. weight, height, temperature, etc.•Standardize data in preprocessing so that all attributes have equal weight•Exceptions: height may be a more important attribute associated with basketball playersSimilarity and Dissimilarity Between Objects•Distances are normally used to measure the similarity or dissimilarity between two data objects (objects=records)•Minkowski distance:where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-dimensional data objects, and q is a positive integer•If q = 1, d is Manhattan distanceqqppqqjxixjxixjxixjid )||...|||(|),(2211||...||||),(2211 ppjxixjxixjxixjid Similarity and Dissimilarity Between Objects (Cont.)•If q = 2, d is Euclidean distance:•Properties•d(i,j)  0•d(i,i) = 0•d(i,j) = d(j,i)•d(i,j)  d(i,k) + d(k,j)•Can also use weighted distance, or other dissimilarity measures.)||...|||(|),(2222211 ppjxixjxixjxixjid Binary Attributes•A contingency table for binary data•Simple matching coefficient (if the binary attribute is symmetric):•Jaccard coefficient (if the binary attribute is asymmetric): dcbacb jid),(pdbcasumdcdcbabasum0101cbacb jid),(Object iObject jDissimilarity between Binary Attributes•Example i j•gender is a symmetric attribute•remaining attributes are asymmetric•let the values Y and P be set to 1, and the value N be set to 0Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4Jack M Y N P N N NMary F Y N P N P NJim M Y P N N N N75.021121),(67.011111),(33.010210),(maryjimdjimjackdmaryjackdNominal Attributes•A generalization of the binary attribute in that it can take more than 2 states, e.g., red, yellow, blue, green•Method 1: Simple matching•m: # of attributes that are same for both records, p: total # of attributes•Method 2: rewrite the database and create a new binary attribute for each of the m states•For an object with color yellow, the yellow attribute is set to 1, while the remaining attributes are


View Full Document
Download Cluster Analysis
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Cluster Analysis and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Cluster Analysis 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?