UIC CS 583 - Chapter 5-Clustering - D813689

Home> Schools> University of Illinois at Chicago> Computer Science (CS) > CS 583> Chapter 5-Clustering

DOC PREVIEW

UIC CS 583 - Chapter 5-Clustering

School name University of Illinois at Chicago

Course Cs 583- Data Mining and Text Mining

Pages 40

This preview shows page 1-2-3-19-20-38-39-40 out of 40 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 40 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 40 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 40 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 40 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 40 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 40 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 40 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 40 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 40 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Chapter 5: ClusteringSearching for groupsGroup similar points togetherAn IllustrationExamples of Clustering ApplicationsConcepts of ClusteringClusteringDissimilarity/Distance MeasureTypes of data in clustering analysisInterval-valued variablesSimilarity Between ObjectsSimilarity Between Objects (Cont.)Binary VariablesDissimilarity of Binary VariablesNominal VariablesOrdinal VariablesRatio-Scaled VariablesVariables of Mixed TypesMajor Clustering TechniquesPartitioning Algorithms: Basic ConceptThe K-Means ClusteringExampleComments on K-MeansVariations of the K-Means Methodk-Medoids clustering methodPartition Around MedoidsComments on PAMCLARA: Clustering Large ApplicationsHierarchical ClusteringAgglomerative ClusteringSlide 31Divisive ClusteringMore on Hierarchical MethodsSummaryOther Data Mining MethodsSequence analysisSequence analysis (cont …)Discovering holes in dataData and pattern visualizationScaling up data mining algorithmsUIC - CS 594 1Chapter 5: ClusteringUIC - CS 594 2Searching for groupsClustering is unsupervised or undirected.Unlike classification, in clustering, no pre-classified data.Search for groups or clusters of data points (records) that are similar to one another.Similar points may mean: similar customers, products, that will behave in similar ways.UIC - CS 594 3Group similar points togetherGroup points into classes using some distance measures.Within-cluster distance, and between cluster distanceApplications:As a stand-alone tool to get insight into data distribution As a preprocessing step for other algorithmsUIC - CS 594 4An IllustrationUIC - CS 594 5Examples of Clustering ApplicationsMarketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programsInsurance: Identifying groups of motor insurance policy holders with some interesting characteristics. City-planning: Identifying groups of houses according to their house type, value, and geographical locationUIC - CS 594 6Concepts of ClusteringClustersDifferent ways of representing clustersDivision with boundariesSpheresProbabilisticDendrograms…1 2 3I1I2…In0.5 0.2 0.3UIC - CS 594 7ClusteringClustering qualityInter-clusters distance  maximizedIntra-clusters distance  minimizedThe quality of a clustering result depends on both the similarity measure used by the method and its application.The quality of a clustering method is also measured by its ability to discover some or all of the hidden patterns Clustering vs. classificationWhich one is more difficult? Why?There are a huge number of clustering techniques.UIC - CS 594 8Dissimilarity/Distance MeasureDissimilarity/Similarity metric: Similarity is expressed in terms of a distance function, which is typically metric: d (i, j)The definitions of distance functions are usually very different for interval-scaled, boolean, categorical, ordinal and ratio variables.Weights should be associated with different variables based on applications and data semantics.It is hard to define “similar enough” or “good enough”. The answer is typically highly subjective.UIC - CS 594 9Types of data in clustering analysisInterval-scaled variablesBinary variablesNominal, ordinal, and ratio variablesVariables of mixed typesUIC - CS 594 10Interval-valued variablesContinuous measurements in a roughly linear scale, e.g., weight, height, temperature, etcStandardize data (depending on applications)Calculate the mean absolute deviation: whereCalculate the standardized measurement (z-score).)...211nffffxx(xn m |)|...|||(|121 fnffffffmxmxmxns ffififsmx zUIC - CS 594 11Similarity Between ObjectsDistance: Measure the similarity or dissimilarity between two data objectsSome popular ones include: Minkowski distance:where (xi1, xi2, …, xip) and (xj1, xj2, …, xjp) are two p-dimensional data objects, and q is a positive integerIf q = 1, d is Manhattan distanceqqppqqjxixjxixjxixjid )||...|||(|),(2211||...||||),(2211 ppjxixjxixjxixjid UIC - CS 594 12Similarity Between Objects (Cont.)If q = 2, d is Euclidean distance:Propertiesd(i,j)  0d(i,i) = 0d(i,j) = d(j,i)d(i,j)  d(i,k) + d(k,j)Also, one can use weighted distance, and many other similarity/distance measures. )||...|||(|),(2222211 ppjxixjxixjxixjid UIC - CS 594 13Binary VariablesA contingency table for binary dataSimple matching coefficient (invariant, if the binary variable is symmetric):Jaccard coefficient (noninvariant if the binary variable is asymmetric): dcbacb jid),(pdbcasumdcdcbabasum0101cbacb jid),(Object iObject jUIC - CS 594 14Dissimilarity of Binary VariablesExamplegender is a symmetric attribute (not used below)the remaining attributes are asymmetric attributeslet the values Y and P be set to 1, and the value N be set to 0Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4Jack M Y N P N N NMary F Y N P N P NJim M Y P N N N N75.021121),(67.011111),(33.010210),(maryjimdjimja ckdmaryjackdUIC - CS 594 15Nominal VariablesA generalization of the binary variable in that it can take more than 2 states, e.g., red, yellow, blue, green, etcMethod 1: Simple matchingm: # of matches, p: total # of variablesMethod 2: use a large number of binary variablescreating a new binary variable for each of the M nominal statespmpjid),(UIC - CS 594 16Ordinal VariablesAn ordinal variable can be discrete or continuousOrder is important, e.g., rankCan be treated like interval-scaled (f is a variable)replace xif by their ranks map the range of each variable onto [0, 1] by replacing i-th object in the f-th variable bycompute the dissimilarity using methods for interval-scaled variables11fififMrz},...,1{fifMr UIC - CS 594 17Ratio-Scaled VariablesRatio-scaled variable: a measurement on a nonlinear scale, approximately at exponential scale, such as AeBt or Ae-Bt, e.g., growth of a bacteria population. Methods:treat them like interval-scaled variables—not a good idea! (why?—the scale can be distorted)apply logarithmic transformationyif = log(xif)treat them as continuous ordinal data and then treat their ranks

View Full Document