Purdue CS 490D - Clustering - D1680612

Home> Schools> Purdue University> Computer Sciences (CS) > CS 490D> Clustering

Purdue CS 490D - Clustering

Course Cs 490d- Introduction to Data Mining

Pages 55

Download Save

Unformatted text preview:

1CS490D:Introduction to Data MiningProf. Chris CliftonFebruary 22, 2004ClusteringCS490D 2Cluster Analysis• What is Cluster Analysis?• Types of Data in Cluster Analysis• A Categorization of Major Clustering Methods• Partitioning Methods• Hierarchical Methods• Density-Based Methods• Grid-Based Methods• Model-Based Clustering Methods• Outlier Analysis• Summary2CS490D 3What is Cluster Analysis?• Cluster: a collection of data objects– Similar to one another within the same cluster– Dissimilar to the objects in other clusters• Cluster analysis– Grouping a set of data objects into clusters• Clustering is unsupervised classification: no predefined classes• Typical applications– As a stand-alone tool to get insight into data distribution – As a preprocessing step for other algorithmsCS490D 4General Applications of Clustering • Pattern Recognition• Spatial Data Analysis – create thematic maps in GIS by clustering feature spaces– detect spatial clusters and explain them in spatial data mining• Image Processing• Economic Science (especially market research)• WWW– Document classification– Cluster Weblog data to discover groups of similar access patterns3CS490D 5Examples of Clustering Applications• Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketingprograms• Land use:Identification of areas of similar land use in an earth observation database• Insurance: Identifying groups of motor insurance policy holders with a high average claim cost• City-planning: Identifying groups of houses according to their house type, value, and geographical location• Earth-quake studies:Observed earth quake epicenters should be clustered along continent faultsCS490D 6What Is Good Clustering?• A good clustering method will produce high quality clusters with– high intra-classsimilarity– low inter-class similarity • The qualityof a clustering result depends on both the similarity measure used by the method and its implementation.• The quality of a clustering method is also measured by its ability to discover some or all of the hiddenpatterns.4CS490D 7Requirements of Clustering in Data Mining • Scalability• Ability to deal with different types of attributes• Discovery of clusters with arbitrary shape• Minimal requirements for domain knowledge to determine input parameters• Able to deal with noise and outliers• Insensitive to order of input records• High dimensionality• Incorporation of user-specified constraints• Interpretability and usabilityCS490D 8Cluster Analysis• What is Cluster Analysis?• Types of Data in Cluster Analysis• A Categorization of Major Clustering Methods• Partitioning Methods• Hierarchical Methods• Density-Based Methods• Grid-Based Methods• Model-Based Clustering Methods• Outlier Analysis• Summary5CS490D 9Data Structures• Data matrix– (two modes)• Dissimilarity matrix– (one mode)npx...nfx...n1x...............ipx...ifx...i1x...............1px...1fx...11x0...)2,()1,(:::)2,3()...ndnd0dd(3,10d(2,1)0CS490D 10Measure the Quality of Clustering• Dissimilarity/Similarity metric: Similarity is expressed in terms of a distance function, which is typically metric:d(i, j)• There is a separate “quality” function that measures the “goodness” of a cluster.• The definitions of distance functions are usually very different for interval-scaled, boolean, categorical, ordinal and ratio variables.• Weights should be associated with different variables based on applications and data semantics.• It is hard to define “similar enough” or “good enough”– the answer is typically highly subjective.6CS490D 11Type of data in clustering analysis• Interval-scaled variables:• Binary variables:• Nominal, ordinal, and ratio variables:• Variables of mixed types:CS490D 12Interval-valued variables• Standardize data– Calculate the mean absolute deviation:where– Calculate the standardized measurement (z-score)• Using mean absolute deviation is more robust than using standard deviation .)...211nffffxx(xn m +++=|)|...|||(|121 fnffffffmxmxmxns −++−+−=ffififsmx z−=7CS490D 13Similarity and Dissimilarity Between Objects• Distances are normally used to measure the similarity or dissimilaritybetween two data objects• Some popular ones include: Minkowski distance:where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-dimensional data objects, and q is a positive integer• If q = 1, d is Manhattan distanceqqppqqjxixjxixjxixjid )||...|||(|),(2211−++−+−=||...||||),(2211 pp jxixjxixjxixjid−++−+−=CS490D 14Similarity and Dissimilarity Between Objects (Cont.)• If q = 2, d is Euclidean distance:– Properties• d(i,j) ≥ 0• d(i,i) = 0• d(i,j) = d(j,i)• d(i,j) ≤ d(i,k) + d(k,j)• Also, one can use weighted distance, parametric Pearson product moment correlation, or other disimilaritymeasures)||...|||(|),(2222211 pp jxixjxixjxixjid −++−+−=8CS490D 15Binary Variables• A contingency table for binary data• Simple matching coefficient (invariant, if the binary variable is symmetric):• Jaccard coefficient (noninvariant if the binary variable is asymmetric): dcbacb jid++++=),(pdbcasumdcdcbabasum++++0101cbacb jid+++=),(Object iObject j16Dissimilarity between Binary Variables• Example– gender is a symmetric attribute– the remaining attributes are asymmetric binary– let the values Y and P be set to 1, and the value N be set to 0NameGenderFeverCoughTest-1 Test-2 Test-3 Test-4Jack M Y N P N N NMary F Y N P N P NJim M Y P N N N N75.021121),(67.011111),(33.010210),(====++++++++++++========++++++++++++========++++++++++++====maryjimdjimjackdmaryjackd9CS490D 17Nominal Variables• A generalization of the binary variable in that it can take more than 2 states, e.g., red, yellow, blue, green• Method 1: Simple matching– m: # of matches, p: total # of variables• Method 2: use a large number of binary variables– creating a new binary variable for each of

View Full Document


School:
Email:
New Password:
Confirm Password:

Purdue CS 490D - Clustering

Sign up for free to view:

Please select your school