FSU CIS 5930r - Lecture 9 Cluster Analysis - D2502417

Home> Schools> Florida State University> Computer Science (CIS) > CIS 5930r> Lecture 9 Cluster Analysis

DOC PREVIEW

FSU CIS 5930r - Lecture 9 Cluster Analysis

School name Florida State University

Course Cis 5930r- Selected Topics in Computer Science (13).

Pages 31

This preview shows page 1-2-14-15-30-31 out of 31 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 31 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 31 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 31 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 31 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 31 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 31 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 31 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Cluster AnalysisCluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods Hierarchical Methods Density-Based Methods Grid-Based Methods Model-Based Clustering Methods Outlier Analysis SummaryWhat is Cluster Analysis? Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Grouping a set of data objects into clusters Clustering is unsupervised classification: no predefined classes  Clustering is used: As a stand-alone tool to get insight into data distribution Visualization of clusters may unveil important information As a preprocessing step for other algorithms Efficient indexing or compression often relies on clusteringGeneral Applications of Clustering  Pattern Recognition Spatial Data Analysis  create thematic maps in GIS by clustering feature spaces detect spatial clusters and explain them in spatial data mining Image Processing cluster images based on their visual content Economic Science (especially market research) WWW and IR document classification cluster Weblog data to discover groups of similar access patternsWhat Is Good Clustering? A good clustering method will produce high quality clusters with high intra-class similarity low inter-class similarity  The quality of a clustering result depends on both the similarity measure used by the method and its implementation. The quality of a clustering method is also measured by its ability to discover some or all of the hidden patterns.Requirements of Clustering in Data Mining  Scalability Ability to deal with different types of attributes Discovery of clusters with arbitrary shape Minimal requirements for domain knowledge to determine input parameters Able to deal with noise and outliers Insensitive to order of input records High dimensionality Incorporation of user-specified constraints Interpretability and usabilityOutliers  Outliers are objects that do not belong to any cluster or form clusters of very small cardinality In some applications we are interested in discovering outliers, not clusters (outlier analysis)clusteroutliersCluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods Hierarchical Methods Density-Based Methods Grid-Based Methods Model-Based Clustering Methods Outlier Analysis SummaryData Structures data matrix (two modes) dissimilarity or distancematrix (one mode)npx...nfx...n1x...............ipx...ifx...i1x...............1px...1fx...11x0...)2,()1,(:::)2,3()...ndnd0dd(3,10d(2,1)0the “classic” data inputattributes/dimensionstuples/objectsthe desired data input to some clustering algorithmsobjectsobjectsMeasuring Similarity in Clustering Dissimilarity/Similarity metric: The dissimilarity d(i, j) between two objects i and j is expressed in terms of a distance function, which is typically a metric: d(i, j)0 (non-negativity) d(i, i)=0 (isolation) d(i, j)= d(j, i) (symmetry) d(i, j) ≤ d(i, h)+d(h, j) (triangular inequality) The definitions of distance functions are usually different for interval-scaled, boolean, categorical, ordinal and ratio-scaled variables. Weights may be associated with different variables based on applications and data semantics.Type of data in cluster analysis Interval-scaled variables e.g., salary, height Binary variables e.g., gender (M/F), has_cancer(T/F) Nominal (categorical) variables e.g., religion (Christian, Muslim, Buddhist, Hindu, etc.) Ordinal variables e.g., military rank (soldier, sergeant, lutenant, captain, etc.) Ratio-scaled variables population growth (1,10,100,1000,...) Variables of mixed types multiple attributes with various typesSimilarity and Dissimilarity Between Objects Distance metrics are normally used to measure the similarity or dissimilarity between two data objects The most popular conform to Minkowski distance:where i = (xi1, xi2, …, xin) and j = (xj1, xj2, …, xjn) are two n-dimensional data objects, and p is a positive integer If p = 1, L1is the Manhattan (or city block) distance:ppjnxinxpjxixpjxixjipL/1||...|22||11|),(||...||||),(12211 nn jxixjxixjxixjiL Similarity and Dissimilarity Between Objects (Cont.) If p = 2, L2is the Euclidean distance: Properties d(i,j)  0 d(i,i) = 0 d(i,j) = d(j,i) d(i,j)  d(i,k) + d(k,j) Also one can use weighted distance:)||...|||(|),(2222211 nn jxixjxixjxixjid )||...||2||1(),(2222211 nn jxixnwjxixwjxixwjid Binary Variables A binary variable has two states: 0 absent, 1 present A contingency table for binary data Simple matching coefficient distance (invariant, if the binary variable is symmetric): Jaccard coefficient distance (noninvariant if the binary variable is asymmetric): dcbacb jid),(cbacb jid),(pdbcasumdcdcbabasum0101object iobject jBinary Variables Another approach is to define the similarity of two objects and not their distance. In that case we have the following: Simple matching coefficient similarity: Jaccard coefficient similarity:dcbada jis),(cbaa jis),(Note that: s(i,j) = 1 – d(i,j)Dissimilarity between Binary Variables Example (Jaccard coefficient) all attributes are asymmetric binary 1 denotes presence or positive test 0 denotes absence or negative testName Fever Cough Test-1 Test-2 Test-3 Test-4 Jack 1 0 1 0 0 0 Mary 1 0 1 0 1 0 Jim 1 1 0 0 0 0 75.021121),(67.011111),(33.010210),(maryjimdjimjackdmaryjackd Each variable is mapped to a bitmap (binary vector) Jack: 101000 Mary: 101010 Jim: 110000 Simple match distance: Jaccard coefficient:A simpler definitionName Fever Cough Test-1 Test-2 Test-3 Test-4 Jack 1 0 1 0 0 0 Mary 1 0 1 0 1 0 Jim 1 1 0 0 0 0 bits ofnumber totalpositionsbit common -non ofnumber ),( jid in s1' ofnumber in s1' ofnumber 1),(jijijidVariables of Mixed

View Full Document