UMD LBSC 796 - Clustering, Classifying, and Filtering - D1975724

Home> Schools> University of Maryland, College Park> Library Science (LBSC) > LBSC 796> Clustering, Classifying, and Filtering

DOC PREVIEW

UMD LBSC 796 - Clustering, Classifying, and Filtering

School name University of Maryland, College Park

Course Lbsc 796- Information Retrieval Systems

Pages 13

This preview shows page 1-2-3-4 out of 13 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 13 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 13 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 13 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 13 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 13 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

1LBSC 796/INFM 718R: Week 10Clustering, Classifying, and FilteringJimmy LinCollege of Information StudiesUniversity of MarylandMonday, April 10, 2006The Information Retrieval CycleSourceSelectionSearchQuerySelectionRanked ListExaminationDocumentsDeliveryDocumentsQueryFormulationResourcesource reselectionSystem discoveryVocabulary discoveryConcept discoveryDocument discoveryOrganizing Search ResultsSingle documentCollection of documentsEnhanced presentation(syntax)Concepts and relationships(semantics)Last weekTodayToday’s Focus: Clustering| Can we do better than a ranked list?| How do we automatically group documents into clusters?| What are the issues to consider?Related Topics| Using the tools in your toolbox to tackle related problems:z Classification: automatically assign labels to documentsz Filtering: automatically decide if a document matches my information needsText Clustering| Automatically partition documents into clusters based on contentz Documents within each cluster should be similarz Documents in different clusters should be different| Discover categories in an unsupervised mannerz No sample category labels provided by humans2Visualizing ClustersCentroidsThe Cluster Hypothesis“Closely associated documents tend to be relevant to the same requests.”van Rijsbergen 1979“… I would claim that document clustering can lead to more effective retrieval than linearsearch [which] ignores the relationships thatexist between documents.”van Rijsbergen 1979Outline of Clustering| How do you actually do it?| Why would you want to do it?| How can you build interfaces that support clustering?Two Strategies| Aglommerative (bottom-up) methodsz Start with each document in its own clusterz Iteratively combine smaller clusters to form larger clusters| Divisive (partitional, top-down) methodsz Directly separate documents into clustersHAC| HAC = Hierarchical Agglomerative Clustering| Start with each document in its own cluster| Until there is only one cluster:z Among the current clusters, determine the two clusters ciand cj, that are most similarz Replace ciand cjwith a single cluster ci∪ cj| The history of merging forms the hierarchyHACABCDEF GH3What’s going on geometrically? Cluster Similarity| Assume a similarity function that determines the similarity of two instances: sim(x,y)z What’s appropriate for documents?| What’s the similarity between two clusters?z Single Link: similarity of two most similar membersz Complete Link: similarity of two least similar membersz Group Average: average similarity between membersDifferent Similarity Functions| Single link:z Uses maximum similarity of pairs:z Can result in “straggly” (long and thin) clusters due to chaining effect| Complete link:z Use minimum similarity of pairs:z Makes more “tight,” spherical clusters),(max),(,yxsimccsimjicycxji∈∈=),(min),(,yxsimccsimjicycxji∈∈=Non-Hierarchical Clustering| Typically, must provide the number of desired clusters, k| Randomly choose k instances as seeds, one per cluster| Form initial clusters based on these seeds| Iterate, repeatedly reallocating instances to different clusters to improve the overall clustering| Stop when clustering converges or after a fixed number of iterationsK-Means| Clusters are determined by centroids (center of gravity) of documents in a cluster:| Reassignment of documents to clusters is based on distance to the current cluster centroids∑∈=cxxcrrr||1(c)µK-Means Algorithm| Let d be the distance measure between documents| Select k random instances {s1, s2,… sk} as seeds.| Until clustering converges or other stopping criterion:z Assign each instance xito the cluster cjsuch that d(xi, sj) is minimalz Update the seeds to the centroid of each clusterz For each cluster cj, sj= µ(cj)4K-Means Clustering ExamplePick seedsReassign clustersCompute centroidsxxReasssign clustersxxxxCompute centroidsReassign clustersConverged!K-Means: Discussion| How do you select k?| Results can vary based on random seed selectionz Some seeds can result in poor convergence rate, or convergence to sub-optimal clustersWhy cluster for IR?| Cluster the collectionz Retrieve clusters instead of documents| Cluster the results“Closely associated documents tend to be relevant to the same requests.”“… I would claim that document clustering can lead to more effective retrieval than linear search [which] ignores the relationships that exist between documents.”From Clusters to CentroidsCentroidsClustering the Collection| Basic idea:z Cluster the document collection z Find the centroid of each clusterz Search only on the centroids, but retrieve clusters| If the cluster hypothesis is true, then this should perform better| Why would you want to do this?| Why doesn’t it work?Clustering the Results| Scatter/Gather| Swish(Hearst and Pedersen, 1996)(Chen and Dumais, 2000)5Scatter/Gather| How it worksz The system clusters documents into general “themes”z The system displays the contents of the clusters by showing topical terms and typical titlesz User chooses a subset of the clustersz The system automatically re-clusters documents within selected clusterz The new clusters have different, more refined, “themes”| Originally used to give collection overview| Evidence suggests more appropriate for displaying retrieval results in contextMarti A. Hearst and Jan O. Pedersen. (1996) Reexaming the Cluster Hypothesis: Scatter/Gather on Retrieval Results. Proceedings of SIGIR 1996.symbols 8 docsfilm, tv 68 docsastrophysics 97 docsastronomy 67 docsflora/fauna 10 docsClustering and re-clustering is entirely automatedsports 14 docsfilm, tv 47 docsmusic 7 docsstellar phenomena 12 docsgalaxies, stars 49 docsconstellations 29 docsmiscellaneous 7 docsQuery = “star” on encyclopedic textScatter/Gather ExampleClustering Result Sets| Advantages:z Topically coherent sets of documents are presented to the user togetherz User gets a sense for the range of themes in the result setz Supports exploration and browsing of retrieved hits| Disadvantage:z Clusters might not “make sense”z May be difficult to understand the theme of a cluster based on summary termsz Additional computational processing required| Things to ponder:z What is the relationship between clusters and classification systems?z Why does this work?6Two Queries: Two ClusteringsAUTO, CAR, ELECTRICAUTO, CAR, SAFETYThe main differences are the clusters that are central to the query8 control drive accident …25 battery california

View Full Document