Slide 1Today’s Topic: ClusteringWhat is clustering?A data set with clear cluster structureApplications of clustering in IRYahoo! Hierarchy isn’t clustering but is the kind of output you want from clusteringGoogle News: automatic clustering gives an effective news presentation metaphorScatter/Gather: Cutting, Karger, and PedersenFor visualizing a document collection and its themesFor improving search recallSlide 11Issues for clusteringNotion of similarity/distanceClustering AlgorithmsHard vs. soft clusteringPartitioning AlgorithmsK-MeansK-Means AlgorithmK Means Example (K=2)Termination conditionsConvergenceConvergence of K-MeansSlide 23Time ComplexitySeed ChoiceK-means issues, variations, etc.How Many Clusters?K not specified in advanceSlide 29Penalize lots of clustersHierarchical ClusteringDendrogram: Hierarchical ClusteringHierarchical Agglomerative Clustering (HAC)Closest pair of clustersSingle Link Agglomerative ClusteringSingle Link ExampleComplete LinkComplete Link ExampleComputational ComplexityGroup AverageComputing Group Average SimilarityWhat Is A Good Clustering?External criteria for clustering qualityExternal Evaluation of Cluster QualityPurity exampleRand Index measures between pair decisions. Here RI = 0.68Rand index and Cluster F-measureFinal word and resourcesIntroduction to Information RetrievalIntroduction to Information Retrieval Introduction toInformation RetrievalCS276: Information Retrieval and Web SearchPandu Nayak and Prabhakar RaghavanLecture 12: ClusteringIntroduction to Information RetrievalIntroduction to Information Retrieval Today’s Topic: ClusteringDocument clusteringMotivationsDocument representationsSuccess criteriaClustering algorithmsPartitionalHierarchicalIntroduction to Information RetrievalIntroduction to Information Retrieval What is clustering?Clustering: the process of grouping a set of objects into classes of similar objectsDocuments within a cluster should be similar.Documents from different clusters should be dissimilar.The commonest form of unsupervised learningUnsupervised learning = learning from raw data, as opposed to supervised data where a classification of examples is givenA common and important task that finds many applications in IR and other placesCh. 16Introduction to Information RetrievalIntroduction to Information Retrieval A data set with clear cluster structureHow would you design an algorithm for finding the three clusters in this case?Ch. 16Introduction to Information RetrievalIntroduction to Information Retrieval Applications of clustering in IRWhole corpus analysis/navigationBetter user interface: search without typingFor improving recall in search applicationsBetter search results (like pseudo RF)For better navigation of search resultsEffective “user recall” will be higherFor speeding up vector space retrievalCluster-based retrieval gives faster searchSec. 16.1Introduction to Information RetrievalIntroduction to Information Retrieval Yahoo! Hierarchy isn’t clustering but is the kind of output you want from clusteringdairycropsagronomyforestryAIHCIcraftmissionsbotanyevolutioncellmagnetismrelativitycoursesagriculture biology physics CS space...... ...… (30)www.yahoo.com/Science... ...Introduction to Information RetrievalIntroduction to Information Retrieval Google News: automatic clustering gives an effective news presentation metaphorIntroduction to Information RetrievalIntroduction to Information Retrieval Scatter/Gather: Cutting, Karger, and PedersenSec. 16.1Introduction to Information RetrievalIntroduction to Information Retrieval For visualizing a document collection and its themesWise et al, “Visualizing the non-visual” PNNLThemeScapes, Cartia[Mountain height = cluster size]Introduction to Information RetrievalIntroduction to Information Retrieval For improving search recallCluster hypothesis - Documents in the same cluster behave similarly with respect to relevance to information needsTherefore, to improve search recall:Cluster docs in corpus a prioriWhen a query matches a doc D, also return other docs in the cluster containing DHope if we do this: The query “car” will also return docs containing automobileBecause clustering grouped together docs containing car with those containing automobile.Why might this happen?Sec. 16.1Introduction to Information RetrievalIntroduction to Information Retrieval 11yippy.com – grouping search resultsIntroduction to Information RetrievalIntroduction to Information Retrieval Issues for clusteringRepresentation for clusteringDocument representationVector space? Normalization?Centroids aren’t length normalizedNeed a notion of similarity/distanceHow many clusters?Fixed a priori?Completely data driven?Avoid “trivial” clusters - too large or smallIf a cluster's too large, then for navigation purposes you've wasted an extra user click without whittling down the set of documents much.Sec. 16.2Introduction to Information RetrievalIntroduction to Information Retrieval Notion of similarity/distanceIdeal: semantic similarity.Practical: term-statistical similarityWe will use cosine similarity.Docs as vectors.For many algorithms, easier to think in terms of a distance (rather than similarity) between docs.We will mostly speak of Euclidean distanceBut real implementations use cosine similarityIntroduction to Information RetrievalIntroduction to Information Retrieval Clustering AlgorithmsFlat algorithmsUsually start with a random (partial) partitioningRefine it iterativelyK means clustering(Model based clustering)Hierarchical algorithmsBottom-up, agglomerative(Top-down, divisive)Introduction to Information RetrievalIntroduction to Information Retrieval Hard vs. soft clusteringHard clustering: Each document belongs to exactly one clusterMore common and easier to doSoft clustering: A document can belong to more than one cluster.Makes more sense for applications like creating browsable hierarchiesYou may want to put a pair of sneakers in two clusters: (i) sports apparel and (ii) shoesYou can only do that with a soft clustering approach.We won’t do soft clustering today. See IIR 16.5, 18Introduction to Information RetrievalIntroduction to Information Retrieval Partitioning AlgorithmsPartitioning method: Construct a partition of
View Full Document