1LBSC 796/INFM 718R: Week 10Clustering, Classifying, and FilteringJimmy LinCollege of Information StudiesUniversity of MarylandMonday, April 10, 2006The Information Retrieval CycleSourceSelectionSearchQuerySelectionRanked ListExaminationDocumentsDeliveryDocumentsQueryFormulationResourcesource reselectionSystem discoveryVocabulary discoveryConcept discoveryDocument discoveryOrganizing Search ResultsSingle documentCollection of documentsEnhanced presentation(syntax)Concepts and relationships(semantics)Last weekTodayToday’s Focus: Clustering| Can we do better than a ranked list?| How do we automatically group documents into clusters?| What are the issues to consider?Related Topics| Using the tools in your toolbox to tackle related problems:z Classification: automatically assign labels to documentsz Filtering: automatically decide if a document matches my information needsText Clustering| Automatically partition documents into clusters based on contentz Documents within each cluster should be similarz Documents in different clusters should be different| Discover categories in an unsupervised mannerz No sample category labels provided by humans2Visualizing ClustersCentroidsThe Cluster Hypothesis“Closely associated documents tend to be relevant to the same requests.”van Rijsbergen 1979“… I would claim that document clustering can lead to more effective retrieval than linearsearch [which] ignores the relationships thatexist between documents.”van Rijsbergen 1979Outline of Clustering| How do you actually do it?| Why would you want to do it?| How can you build interfaces that support clustering?Two Strategies| Aglommerative (bottom-up) methodsz Start with each document in its own clusterz Iteratively combine smaller clusters to form larger clusters| Divisive (partitional, top-down) methodsz Directly separate documents into clustersHAC| HAC = Hierarchical Agglomerative Clustering| Start with each document in its own cluster| Until there is only one cluster:z Among the current clusters, determine the two clusters ciand cj, that are most similarz Replace ciand cjwith a single cluster ci∪ cj| The history of merging forms the hierarchyHACABCDEF GH3What’s going on geometrically? Cluster Similarity| Assume a similarity function that determines the similarity of two instances: sim(x,y)z What’s appropriate for documents?| What’s the similarity between two clusters?z Single Link: similarity of two most similar membersz Complete Link: similarity of two least similar membersz Group Average: average similarity between membersDifferent Similarity Functions| Single link:z Uses maximum similarity of pairs:z Can result in “straggly” (long and thin) clusters due to chaining effect| Complete link:z Use minimum similarity of pairs:z Makes more “tight,” spherical clusters),(max),(,yxsimccsimjicycxji∈∈=),(min),(,yxsimccsimjicycxji∈∈=Non-Hierarchical Clustering| Typically, must provide the number of desired clusters, k| Randomly choose k instances as seeds, one per cluster| Form initial clusters based on these seeds| Iterate, repeatedly reallocating instances to different clusters to improve the overall clustering| Stop when clustering converges or after a fixed number of iterationsK-Means| Clusters are determined by centroids (center of gravity) of documents in a cluster:| Reassignment of documents to clusters is based on distance to the current cluster centroids∑∈=cxxcrrr||1(c)µK-Means Algorithm| Let d be the distance measure between documents| Select k random instances {s1, s2,… sk} as seeds.| Until clustering converges or other stopping criterion:z Assign each instance xito the cluster cjsuch that d(xi, sj) is minimalz Update the seeds to the centroid of each clusterz For each cluster cj, sj= µ(cj)4K-Means Clustering ExamplePick seedsReassign clustersCompute centroidsxxReasssign clustersxxxxCompute centroidsReassign clustersConverged!K-Means: Discussion| How do you select k?| Results can vary based on random seed selectionz Some seeds can result in poor convergence rate, or convergence to sub-optimal clustersWhy cluster for IR?| Cluster the collectionz Retrieve clusters instead of documents| Cluster the results“Closely associated documents tend to be relevant to the same requests.”“… I would claim that document clustering can lead to more effective retrieval than linear search [which] ignores the relationships that exist between documents.”From Clusters to CentroidsCentroidsClustering the Collection| Basic idea:z Cluster the document collection z Find the centroid of each clusterz Search only on the centroids, but retrieve clusters| If the cluster hypothesis is true, then this should perform better| Why would you want to do this?| Why doesn’t it work?Clustering the Results| Scatter/Gather| Swish(Hearst and Pedersen, 1996)(Chen and Dumais, 2000)5Scatter/Gather| How it worksz The system clusters documents into general “themes”z The system displays the contents of the clusters by showing topical terms and typical titlesz User chooses a subset of the clustersz The system automatically re-clusters documents within selected clusterz The new clusters have different, more refined, “themes”| Originally used to give collection overview| Evidence suggests more appropriate for displaying retrieval results in contextMarti A. Hearst and Jan O. Pedersen. (1996) Reexaming the Cluster Hypothesis: Scatter/Gather on Retrieval Results. Proceedings of SIGIR 1996.symbols 8 docsfilm, tv 68 docsastrophysics 97 docsastronomy 67 docsflora/fauna 10 docsClustering and re-clustering is entirely automatedsports 14 docsfilm, tv 47 docsmusic 7 docsstellar phenomena 12 docsgalaxies, stars 49 docsconstellations 29 docsmiscellaneous 7 docsQuery = “star” on encyclopedic textScatter/Gather ExampleClustering Result Sets| Advantages:z Topically coherent sets of documents are presented to the user togetherz User gets a sense for the range of themes in the result setz Supports exploration and browsing of retrieved hits| Disadvantage:z Clusters might not “make sense”z May be difficult to understand the theme of a cluster based on summary termsz Additional computational processing required| Things to ponder:z What is the relationship between clusters and classification systems?z Why does this work?6Two Queries: Two ClusteringsAUTO, CAR, ELECTRICAUTO, CAR, SAFETYThe main differences are the clusters that are central to the query8 control drive accident …25 battery california
View Full Document