CS6322: CS6322: Information Retrieval Information Retrieval Sanda HarabagiuSanda HarabagiuLecture 13: ClusteringLecture 13: ClusteringCS6322: Information RetrievalCS6322: Information RetrievalToday’s Topic: Clustering Document clustering Motivations Document representations Success criteria Clustering algorithms Partitional HierarchicalCS6322: Information RetrievalCS6322: Information RetrievalWhat is clustering? Clustering: the process of grouping a set of objects into classes of similar objects Documents within a cluster should be similar. Documents from different clusters should be dissimilar. The commonest form of unsupervised learning Unsupervised learning = learning from raw data, as opposed to supervised data where a classification of examples is given A common and important task that finds many applications in IR and other placesCh. 16CS6322: Information RetrievalCS6322: Information RetrievalA data set with clear cluster structure How would you design an algorithm for finding the three clusters in this case?Ch. 16CS6322: Information RetrievalCS6322: Information RetrievalApplications of clustering in IR Whole corpus analysis/navigation Better user interface: search without typing For improving recall in search applications Better search results (like pseudo RF) For better navigation of search results Effective “user recall” will be higher For speeding up vector space retrieval Cluster-based retrieval gives faster searchSec. 16.1CS6322: Information RetrievalCS6322: Information RetrievalYahoo! Hierarchy isn’t clustering but is the kind of output you want from clusteringdairycropsagronomyforestryAIHCIcraftmissionsbotanyevolutioncellmagnetismrelativitycoursesagriculture biology physics CS space...... ...… (30)www.yahoo.com/Science... ...CS6322: Information RetrievalCS6322: Information RetrievalGoogle News: automatic clustering gives an effective news presentation metaphorCS6322: Information RetrievalCS6322: Information RetrievalScatter/Gather: Cutting, Karger, and PedersenSec. 16.1CS6322: Information RetrievalCS6322: Information RetrievalFor visualizing a document collection and its themes Wise et al, “Visualizing the non-visual” PNNL ThemeScapes, Cartia [Mountain height = cluster size]CS6322: Information RetrievalCS6322: Information RetrievalFor improving search recall Cluster hypothesis - Documents in the same cluster behave similarly with respect to relevance to information needs Therefore, to improve search recall: Cluster docs in corpus a priori When a query matches a doc D, also return other docs in the cluster containing D Hope if we do this: The query “car” will also return docs containing automobile Because clustering grouped together docs containing car with those containing automobile.Why might this happen?Sec. 16.1CS6322: Information RetrievalCS6322: Information RetrievalFor better navigation of search results For grouping search results thematically clusty.com / VivisimoSec. 16.1CS6322: Information RetrievalCS6322: Information RetrievalProblem Statement Define the goal of hard flat clustering. Given:1. A set of documents D={ d1, d2, …, dN}2. A desired number of clusters K3. An objective function that evaluates the quality of clusteringCompute an assignment γ: D → {1, 2, … K} that minimizes the objective function Sometimes we want γ to be surjective (none of the K clusters are empty!!!!) The objective function is defined in terms of similarity or distancebetween documents.CS6322: Information RetrievalCS6322: Information RetrievalIssues for clustering Representation for clustering Document representation Vector space? Normalization? Centroids aren’t length normalized Need a notion of similarity/distance How many clusters? Fixed a priori? Completely data driven? Avoid “trivial” clusters - too large or small If a cluster's too large, then for navigation purposes you've wasted an extra user click without whittling down the set of documents much.Sec. 16.2CS6322: Information RetrievalCS6322: Information RetrievalNotion of similarity/distance Ideal: semantic similarity. Practical: term-statistical similarity We will use cosine similarity. Docs as vectors. For many algorithms, easier to think in terms of a distance (rather than similarity) between docs. We will mostly speak of Euclidean distance But real implementations use cosine similarityCS6322: Information RetrievalCS6322: Information RetrievalClustering Algorithms Flat algorithms Usually start with a random (partial) partitioning Refine it iteratively K means clustering (Model based clustering) Hierarchical algorithms Bottom-up, agglomerative (Top-down, divisive)CS6322: Information RetrievalCS6322: Information RetrievalHard vs. soft clustering Hard clustering: Each document belongs to exactly one cluster More common and easier to do Soft clustering: A document can belong to more than one cluster. Makes more sense for applications like creating browsable hierarchies You may want to put a pair of sneakers in two clusters: (i) sports apparel and (ii) shoes You can only do that with a soft clustering approach. We won’t do soft clustering today. See IIR 16.5, 18CS6322: Information RetrievalCS6322: Information RetrievalPartitioning Algorithms Partitioning method: Construct a partition of ndocuments into a set of K clusters Given: a set of documents and the number K Find: a partition of K clusters that optimizes the chosen partitioning criterion Globally optimal Intractable for many objective functions Ergo, exhaustively enumerate all partitions Effective heuristic methods: K-means and K-medoids algorithmsCS6322: Information RetrievalCS6322: Information RetrievalK-Means Assumes documents are real-valued vectors. Clusters based on centroids (aka the center of gravityor mean) of points in a cluster, c: Reassignment of instances to clusters is based on distance to the current cluster centroids. (Or one can equivalently phrase it in terms of similarities)∑∈=cxxcrrr||1(c)µSec. 16.4CS6322: Information RetrievalCS6322: Information RetrievalK-Means Algorithm1. Select K random docs {s1, s2,… sK} as seeds.2. Until clustering converges (or other stopping criterion):a) For each doc di:Assign dito the cluster cjsuch that
View Full Document