Chapter 4: Unsupervised LearningRoad mapSupervised learning vs. unsupervised learningClusteringAn illustrationWhat is clustering for?What is clustering for? (cont…)Aspects of clusteringSlide 9K-means clusteringK-means algorithmK-means algorithm – (cont …)Stopping/convergence criterionAn exampleAn example (cont …)An example distance functionA disk version of k-meansA disk version of k-means (cont …)Strengths of k-meansWeaknesses of k-meansWeaknesses of k-means: Problems with outliersWeaknesses of k-means: To deal with outliersWeaknesses of k-means (cont …)Slide 24Slide 25K-means summarySlide 27Common ways to represent clustersUsing classification modelUse frequent values to represent clusterClusters of arbitrary shapesSlide 32Hierarchical ClusteringTypes of hierarchical clusteringAgglomerative clusteringAgglomerative clustering algorithmAn example: working of the algorithmMeasuring the distance of two clustersSingle link methodComplete link methodAverage link and centroid methodsThe complexitySlide 43Distance functionsDistance functions for numeric attributesEuclidean distance and Manhattan distanceSquared distance and Chebychev distanceDistance functions for binary and nominal attributesConfusion matrixSymmetric binary attributesSymmetric binary attributes: exampleAsymmetric binary attributesNominal attributesDistance function for text documentsSlide 55Data standardizationInterval-scaled attributesInterval-scaled attributes (cont …)Ratio-scaled attributesSlide 60Nominal attributes: an exampleOrdinal attributesSlide 63Mixed attributesConvert to a single typeConvert to a single type (cont …)Combining individual distancesSlide 68How to choose a clustering algorithmChoose a clustering algorithm (cont …)Slide 71Cluster Evaluation: hard problemCluster evaluation: ground truthEvaluation measures: EntropyEvaluation measures: puritySlide 76A remark about ground truth evaluationEvaluation based on internal informationIndirect evaluationSlide 80Holes in data spaceHoles are useful tooData regions and empty regionsSupervised learning for unsupervised learningSlide 85Can it done without adding N points?Characteristics of the approachBuilding the TreeSlide 89How many N points to add?Slide 91How many N points to add? (cont…)Building the decision treeSlide 94SummaryChapter 4: Unsupervised LearningCS583, Bing Liu, UIC2Road mapBasic conceptsK-means algorithmRepresentation of clustersHierarchical clusteringDistance functionsData standardizationHandling mixed attributesWhich clustering algorithm to use?Cluster evaluationDiscovering holes and data regionsSummaryCS583, Bing Liu, UIC3Supervised learning vs. unsupervised learningSupervised learning: discover patterns in the data that relate data attributes with a target (class) attribute. These patterns are then utilized to predict the values of the target attribute in future data instances. Unsupervised learning: The data have no target attribute. We want to explore the data to find some intrinsic structures in them.CS583, Bing Liu, UIC4ClusteringClustering is a technique for finding similarity groups in data, called clusters. I.e., it groups data instances that are similar to (near) each other in one cluster and data instances that are very different (far away) from each other into different clusters. Clustering is often called an unsupervised learning task as no class values denoting an a priori grouping of the data instances are given, which is the case in supervised learning. Due to historical reasons, clustering is often considered synonymous with unsupervised learning.In fact, association rule mining is also unsupervisedThis chapter focuses on clustering.CS583, Bing Liu, UIC5An illustrationThe data set has three natural groups of data points, i.e., 3 natural clusters.CS583, Bing Liu, UIC6What is clustering for? Let us see some real-life examplesExample 1: groups people of similar sizes together to make “small”, “medium” and “large” T-Shirts.Tailor-made for each person: too expensiveOne-size-fits-all: does not fit all. Example 2: In marketing, segment customers according to their similaritiesTo do targeted marketing.CS583, Bing Liu, UIC7What is clustering for? (cont…)Example 3: Given a collection of text documents, we want to organize them according to their content similarities,To produce a topic hierarchyIn fact, clustering is one of the most utilized data mining techniques. It has a long history, and used in almost every field, e.g., medicine, psychology, botany, sociology, biology, archeology, marketing, insurance, libraries, etc. In recent years, due to the rapid increase of online documents, text clustering becomes important.CS583, Bing Liu, UIC8Aspects of clusteringA clustering algorithmPartitional clusteringHierarchical clustering…A distance (similarity, or dissimilarity) functionClustering qualityInter-clusters distance maximizedIntra-clusters distance minimizedThe quality of a clustering result depends on the algorithm, the distance function, and the application.CS583, Bing Liu, UIC9Road mapBasic conceptsK-means algorithmRepresentation of clustersHierarchical clusteringDistance functionsData standardizationHandling mixed attributesWhich clustering algorithm to use?Cluster evaluationDiscovering holes and data regionsSummaryCS583, Bing Liu, UIC10K-means clusteringK-means is a partitional clustering algorithmLet the set of data points (or instances) D be {x1, x2, …, xn}, where xi = (xi1, xi2, …, xir) is a vector in a real-valued space X Rr, and r is the number of attributes (dimensions) in the data. The k-means algorithm partitions the given data into k clusters. Each cluster has a cluster center, called centroid.k is specified by the userCS583, Bing Liu, UIC11K-means algorithmGiven k, the k-means algorithm works as follows: 1) Randomly choose k data points (seeds) to be the initial centroids, cluster centers2) Assign each data point to the closest centroid3) Re-compute the centroids using the current cluster memberships.4) If a convergence criterion is not met, go to 2).CS583, Bing Liu, UIC12K-means algorithm – (cont …)CS583, Bing Liu, UIC13Stopping/convergence criterion 1. no (or minimum) re-assignments of data points to different clusters, 2. no (or minimum) change of centroids, or 3. minimum decrease in the
View Full Document