Clustering CS4780 Machine Learning Fall 2009 Thorsten Joachims Cornell University Reading Manning Schuetze Chapter 14 not 14 1 3 14 1 4 Based on slides from Prof Claire Cardie Prof Ray Mooney Prof Yiming Yang Outline Supervised vs Unsupervised Learning Hierarchical Clustering Hierarchical Agglomerative Clustering HAC Non Hierarchical Clustering K means EM Algorithm Supervised vs Unsupervised Learning Supervised Learning Classification partition examples into groups according to pre defined categories Regression assign value to feature vectors Requires labeled data for training Unsupervised Learning Clustering partition examples into groups when no pre defined categories classes are available Novelty detection find changes in data Outlier detection find unusual events e g hackers Only instances required but no labels Clustering Partition unlabeled examples into disjoint subsets of clusters such that Examples within a cluster are similar Examples in different clusters are different Discover new categories in an unsupervised manner no sample category labels provided Applications of Clustering Cluster retrieved documents e g Teoma to present more organized and understandable results to user Detecting near duplicates Entity resolution E g Thorsten Joachims Thorsten B Joachims Cheating detection Exploratory data analysis Automated or semi automated creation of taxonomies e g Yahoo style Compression Clustering Example Clustering Example Clustering Example Similarity Distance Measures Euclidian distance L2 norm L2 x x m xi xi 2 i 1 L1 norm L1 x x m xi xi i 1 Cosine similarity cos x x Kernels x x x x Hierarchical Clustering Build a tree based hierarchical taxonomy from a set of unlabeled examples animal vertebrate fish reptile amphib mammal invertebrate worm insect crustacean Recursive application of a standard clustering algorithm can produce a hierarchical clustering Agglomerative vs Divisive Clustering Agglomerative bottom up methods start with each example in its own cluster and iteratively combine them to form larger and larger clusters Divisive top down separate all examples immediately into clusters animal vertebrate fish reptile amphib mammal invertebrate worm insect crustacean Hierarchical Agglomerative Clustering HAC Assumes a similarity function for determining the similarity of two clusters Starts with all instances in a separate cluster and then repeatedly joins the two clusters that are most similar until there is only one cluster The history of merging forms a binary tree or hierarchy Basic algorithm Start with all instances in their own cluster Until there is only one cluster Among the current clusters determine the two clusters ci and cj that are most similar Replace ci and cj with a single cluster ci cj Cluster Similarity How to compute similarity of two clusters each possibly containing multiple instances Single link Similarity of two most similar members Complete link Similarity of two least similar members Group average Average similarity between members Single Link Agglomerative Clustering When computing cluster similarity use maximum similarity of pairs sim ci c j max sim x y x ci y c j Can result in straggly long and thin clusters due to chaining effect Single Link Example 1 2 5 6 3 4 7 8 Complete Link Agglomerative Clustering When computing cluster similarity use minimum similarity of pairs sim ci c j min sim x y x ci y c j Makes more tight spherical clusters Complete Link Example 1 2 5 6 3 4 7 8 Computational Complexity of HAC In the first iteration all HAC methods need to compute similarity of all pairs of n individual instances which is O n2 In each of the subsequent n 2 merging iterations it must compute the distance between the most recently created cluster and all other existing clusters In order to maintain the similarity matrix in O n2 overall computing the similarity to any other cluster must each be done in constant time Maintain e g Heap to find smallest pair Computing Cluster Similarity After merging ci and cj the similarity of the resulting cluster to any other cluster ck can be computed by Single Link sim ci c j ck max sim ci ck sim c j ck Complete Link sim ci c j ck min sim ci ck sim c j ck Group Average Agglomerative Clustering Use average similarity across all pairs within the merged cluster to measure the similarity of two clusters sim ci c j ci 1 c j ci cj 1 x sim x y ci c j y ci c j y x Compromise between single and complete link Computing Group Average Similarity Assume cosine similarity and normalized vectors with unit length Always maintain sum of vectors in each cluster s c j x cj x Compute similarity of clusters in constant time sim ci c j s ci s c j s ci s c j ci ci ci ci ci ci 1 Non Hierarchical Clustering Single pass clustering K means clustering hard Expectation maximization soft Clustering Criterion Evaluation function that assigns a usually realvalued value to a clustering Clustering criterion typically function of within cluster similarity and between cluster dissimilarity Optimization Find clustering that maximizes the criterion Global optimization often intractable Greedy search Approximation algorithms Centroid Based Clustering Assumes instances are real valued vectors Clusters represented via centroids i e mean of points in a cluster c c 1 x c x c Reassignment of instances to clusters is based on distance to the current cluster centroids K Means Algorithm Input k number of clusters distance measure d Select k random instances s1 s2 sk as seeds Until clustering converges or other stopping criterion For each instance xi Assign xi to the cluster cj such that d xi sj is min For each cluster cj update the centroid of each cluster sj cj K means Example k 2 Pick seeds Reassign clusters Compute centroids Reasssign clusters x x x x Compute centroids Reassign clusters Converged Time Complexity Assume computing distance between two instances is O m where m is the dimensionality of the vectors Reassigning clusters for n points O kn distance computations or O knm Computing centroids Each instance gets added once to some centroid O nm Assume these two steps are each done once for i iterations O iknm Linear in all relevant factors assuming a fixed number of iterations more efficient than HAC Buckshot Algorithm Problem Results can vary based on random seed selection especially for high dimensional data Some seeds can result in poor convergence rate or convergence to sub optimal clusterings Idea Combine HAC and K means clustering First
View Full Document
Unlocking...