1Eric Xing 1Machine LearningMachine Learning1010--701/15701/15--781, Spring 2008781, Spring 2008ClusteringClusteringEric XingEric XingLecture 15, March 17, 2008Reading: Chap. 9, C.B bookEric Xing 2What is clustering?z Clustering: the process of grouping a set of objects into classes of similar objectsz high intra-class similarityz low inter-class similarityz It is the commonest form of unsupervised learningz Unsupervised learning = learning from raw (unlabeled, unannotated, etc) data, as opposed to supervised data where a classification of examples is givenz A common and important task that finds many applications in Science, Engineering, information Science, and other placesz Group genes that perform the same functionz Group individuals that has similar political viewz Categorize documents of similar topics z Ideality similar objects from pictures2Eric Xing 3Examplesz Peoplez Imagesz Languagez speciesEric Xing 4Issues for clusteringz What is a natural grouping among these objects?z Definition of "groupness"z What makes objects “related”?z Definition of "similarity/distance"z Representation for objectsz Vector space? Normalization?z How many clusters?z Fixed a priori?z Completely data driven?z Avoid “trivial” clusters - too large or smallz Clustering Algorithmsz Partitional algorithmsz Hierarchical algorithmsz Formal foundation and convergence3Eric Xing 5What is a natural grouping among these objects?Eric Xing 6What is Similarity?z The real meaning of similarity is a philosophical question. We will take a more pragmatic approachz Depends on representation and algorithm. For many rep./alg., easier to think in terms of a distance (rather than similarity) between vectors.Hard to define! Hard to define! ButButwe know it we know it when we see itwhen we see it4Eric Xing 7What properties should a distance measure have?z D(A,B) = D(B,A) Symmetryz D(A,A) = 0 Constancy of Self-Similarityz D(A,B) = 0 IIf A= B Positivity Separationz D(A,B) ≤ D(A,C) + D(B,C) Triangular InequalityEric Xing 8z D(A,B) = D(B,A) Symmetryz Otherwise you could claim "Alex looks like Bob, but Bob looks nothing like Alex"z D(A,A) = 0 Constancy of Self-Similarityz Otherwise you could claim "Alex looks more like Bob, than Bob does"z D(A,B) = 0 IIf A= B Positivity Separationz Otherwise there are objects in your world that are different, but you cannot tell apart.z D(A,B) ≤ D(A,C) + D(B,C) Triangular Inequalityz Otherwise you could claim "Alex is very like Bob, and Alex is very like Carl, but Bob is very unlike Carl"Intuitions behind desirable distance measure properties5Eric Xing 9Distance Measures: MinkowskiMetricz Suppose two object x and y both have p featuresz The Minkowski metric is defined byz Most Common Minkowski Metricsrpiiiryxyxd ||),(∑=−=1),,,(),,,(ppyyyyxxxxLL2121==||max),( ) distance sup"(" 3,||),( distance) (Manhattan 2,||),( ) distance (Euclidean 1,iipipiiipiiiyxyxd ryxyxd ryxyxd r−=+∞=−==−==≤≤==∑∑1122112Eric Xing 10.},{max :distance sup"" :3. :distanceManhattan :2. :distanceEuclidean :1434734534222==+=+An Example43xy6Eric Xing 1111011111100001110100111001001001101716151413121110987654321GeneBGeneA. :Distance Hamming 5141001 =+=+ )#()#(z Manhattan distance is called Hamming distance when all features are binary.z Gene Expression Levels Under 17 Conditions (1-High,0-Low)Hamming distanceEric Xing 12. and where)()())((),(∑∑∑∑∑=======−×−−−=piippiippipiiipiiiyyxxyyxxyyxxyxs1111112211≤),( yxsSimilarity Measures: Correlation Coefficientz Pearson correlation coefficientz Special case: cosine distanceyxyxyxsrrrr⋅⋅=),(7Eric Xing 13Similarity Measures: Correlation CoefficientTimeGene AGene BGene ATimeGene BExpression LevelExpression LevelExpression LevelTimeGene AGene BEric Xing 14A generic technique for measuring similarityz To measure the similarity between two objects, transform one of the objects into the other, and measure how much effort it took. The measure of effort becomes the distance measure.The distance between Patty and Selma.Change dress color, 1 pointChange earring shape, 1 pointChange hair part, 1 pointD(Patty,Selma) = 3The distance between Marge and Selma.Change dress color, 1 pointAdd earrings, 1 pointDecrease height, 1 pointTake up smoking, 1 pointLose weight, 1 pointDPMarge,Selma) = 5This is called the Edit distance or theTransformation distance8Eric Xing 15Clustering Algorithmsz Partitional algorithmsz Usually start with a random (partial) partitioningz Refine it iterativelyz K means clusteringz Mixture-Model based clusteringz Hierarchical algorithmsz Bottom-up, agglomerativez Top-down, divisiveEric Xing 16Hierarchical Clusteringz Build a tree-based hierarchical taxonomy (dendrogram) from a set of documents.z Note that hierarchies are commonly used to organize information, for example in a web portal.z Yahoo! is hierarchy is manually created, we will focus on automatic creation of hierarchies in data mining.9Eric Xing 17Dendogramz A Useful Tool for Summarizing Similarity Measurementz The similarity between two objects in a dendrogram is represented as the height of the lowest internal node they share.z Clustering obtained by cutting the dendrogram at a desired level: each connected component forms a cluster.Eric Xing 18Hierarchical Clusteringz Bottom-Up Agglomerative Clusteringz Starts with each obj in a separate cluster z then repeatedly joins the closest pair of clusters, z until there is only one cluster.The history of merging forms a binary tree or hierarchy.z Top-Down divisive z Starting with all the data in a single cluster, z Consider every possible way to divide the cluster into two. Choose the best division z And recursively operate on both sides.10Eric Xing 19Closest pair of clustersThe distance between two clusters is defined as the distance betweenz Single-Link z Nearest Neighbor: their closest members.z Complete-Link z Furthest Neighbor: their furthest members.z Centroid: z Clusters whose centroids (centers of gravity) are the most cosine-similarz Average: z average of all cross-cluster pairs.Eric Xing 20ba453652cbadcbDistance MatrixEuclidean Distance453,cbadc453652cbadcb4,, cbad(1)(2)(3)a,b,cccda,bd da,b,c,dSingle-Link Method11Eric Xing 21ba453652cbadcbDistance MatrixEuclidean Distance465,cbadc453652cbadcb6,,badc(1)(2)(3)a,bccda,bdc,da,b,c,dComplete-Link MethodEric Xing 22 a b c d a b c d2460Single-Link Complete-LinkDendrograms12Eric Xing 23Another ExampleEric Xing 24Computational Complexityz
View Full Document