Flat ClusteringToday’s Topic: ClusteringWhat is clustering?A data set with clear cluster structureWhy cluster documents?Yahoo! Hierarchy isn’t clustering but is the kind of output you want from clusteringGoogle News: automatic clustering gives an effective news presentation metaphorScatter/Gather: Cutting, Karger, and PedersenFor visualizing a document collection and its themesFor improving search recallFor better navigation of search resultsIssues for clusteringWhat makes docs “related”?Clustering AlgorithmsHard vs. soft clusteringPartitioning AlgorithmsK-MeansK-Means AlgorithmK Means Example (K=2)Termination conditionsConvergenceConvergence of K-MeansTime ComplexitySeed ChoiceK-means issues, variations, etc.How Many Clusters?Hierarchical Clustering“The Curse of Dimensionality”Hierarchical ClusteringPowerPoint PresentationHierarchical Clustering algorithmsHierarchical Agglomerative Clustering (HAC) AlgorithmDendrogram: Document ExampleKey notion: cluster representativeExample: n=6, k=3, closest pair of centroidsOutliers in centroid computationClosest pair of clustersSingle Link Agglomerative ClusteringSingle Link ExampleComplete Link Agglomerative ClusteringComplete Link ExampleComputational ComplexityGroup Average Agglomerative ClusteringComputing Group Average SimilarityMedoid as Cluster RepresentativeEfficiency: “Using approximations”Major issue - labelingHow to Label ClustersLabelingWhat is a Good Clustering?External criteria for clustering qualityExternal Evaluation of Cluster QualityPurity exampleRand IndexRand index: symmetric versionRand Index example: 0.68Final word and resourcesSKIP WHAT FOLLOWSTerm vs. document spaceTerm vs. document spaceMulti-lingual docsFeature selectionEvaluation of clusteringApproaches to evaluatingAnecdotal evaluationGround “truth” comparisonMicroeconomic viewpointEvaluation example: Cluster retrievalEvaluationPrasad L16FlatCluster 1Flat ClusteringAdapted from Slides by Prabhakar Raghavan, Christopher Manning, Ray Mooney and Soumen Chakrabarti2Today’s Topic: ClusteringDocument clusteringMotivationsDocument representationsSuccess criteriaClustering algorithmsPartitional (Flat)Hierarchical (Tree)Prasad L16FlatCluster3What is clustering?Clustering: the process of grouping a set of objects into classes of similar objectsDocuments within a cluster should be similar.Documents from different clusters should be dissimilar.The commonest form of unsupervised learningUnsupervised learning = learning from raw data, as opposed to supervised data where a classification of examples is givenA common and important task that finds many applications in IR and other placesPrasad L16FlatClusterA data set with clear cluster structureHow would you design an algorithm for finding the three clusters in this case?5Why cluster documents?Whole corpus analysis/navigationBetter user interfaceFor improving recall in search applicationsBetter search results (pseudo RF)For better navigation of search resultsEffective “user recall” will be higherFor speeding up vector space retrievalFaster searchPrasad L16FlatCluster6Yahoo! Hierarchy isn’t clustering but is the kind of output you want from clusteringdairycropsagronomyforestryAIHCIcraftmissionsbotanyevolutioncellmagnetismrelativitycoursesagriculture biology physics CS space...... ...… (30)www.yahoo.com/Science... ...Prasad L16FlatClusterGoogle News: automatic clustering gives an effective news presentation metaphorScatter/Gather: Cutting, Karger, and PedersenFor visualizing a document collection and its themesWise et al, “Visualizing the non-visual” PNNLThemeScapes, Cartia[Mountain height = cluster size]10For improving search recallCluster hypothesis - Documents in the same cluster behave similarly with respect to relevance to information needsTherefore, to improve search recall:Cluster docs in corpus a prioriWhen a query matches a doc D, also return other docs in the cluster containing DExample: The query “car” will also return docs containing automobileBecause clustering grouped together docs containing car with those containing automobile.Why might this happen?Prasad11For better navigation of search resultsFor grouping search results thematicallyclusty.com / VivisimoPrasad L16FlatCluster12Issues for clusteringRepresentation for clusteringDocument representationVector space? Normalization?Need a notion of similarity/distanceHow many clusters?Fixed a priori?Completely data driven?Avoid “trivial” clusters - too large or smallPrasad L16FlatCluster13What makes docs “related”? Ideal: semantic similarity.Practical: statistical similarityDocs as vectors.For many algorithms, easier to think in terms of a distance (rather than similarity) between docs.We can use cosine similarity (alternatively, Euclidean Distance).Prasad L16FlatCluster14Clustering AlgorithmsPartitional (Flat) algorithmsUsually start with a random (partial) partitionRefine it iterativelyK means clustering(Model based clustering)Hierarchical (Tree) algorithmsBottom-up, agglomerative(Top-down, divisive)Prasad L16FlatClusterHard vs. soft clusteringHard clustering: Each document belongs to exactly one clusterMore common and easier to doSoft clustering: A document can belong to more than one cluster.Makes more sense for applications like creating browsable hierarchiesYou may want to put a pair of sneakers in two clusters: (i) sports apparel and (ii) shoes16Partitioning AlgorithmsPartitioning method: Construct a partition of n documents into a set of K clustersGiven: a set of documents and the number K Find: a partition of K clusters that optimizes the chosen partitioning criterionGlobally optimal: exhaustively enumerate all partitionsEffective heuristic methods: K-means and K-medoids algorithmsPrasad L16FlatCluster17K-MeansAssumes documents are real-valued vectors.Clusters based on centroids (aka the center of gravity or mean) of points in a cluster, c.Reassignment of instances to clusters is based on distance to the current cluster centroids.(Or one can equivalently phrase it in terms of similarities)Prasad L16FlatClustercxxc||1(c)μ18K-Means AlgorithmSelect K random docs {s1, s2,… sK} as seeds.Until clustering converges or other stopping criterion: For each doc di: Assign di to the cluster cj such
View Full Document