DOC PREVIEW
UT Dallas CS 6359 - Lecture13

This preview shows page 1-2-3-4-29-30-31-32-59-60-61-62 out of 62 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 62 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 62 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 62 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 62 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 62 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 62 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 62 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 62 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 62 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 62 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 62 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 62 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 62 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

CS6322: CS6322: Information Retrieval Information Retrieval Sanda HarabagiuSanda HarabagiuLecture 13: ClusteringLecture 13: ClusteringCS6322: Information RetrievalCS6322: Information RetrievalToday’s Topic: Clustering Document clustering Motivations Document representations Success criteria Clustering algorithms Partitional HierarchicalCS6322: Information RetrievalCS6322: Information RetrievalWhat is clustering? Clustering: the process of grouping a set of objects into classes of similar objects Documents within a cluster should be similar. Documents from different clusters should be dissimilar. The commonest form of unsupervised learning Unsupervised learning = learning from raw data, as opposed to supervised data where a classification of examples is given A common and important task that finds many applications in IR and other placesCh. 16CS6322: Information RetrievalCS6322: Information RetrievalA data set with clear cluster structure How would you design an algorithm for finding the three clusters in this case?Ch. 16CS6322: Information RetrievalCS6322: Information RetrievalApplications of clustering in IR Whole corpus analysis/navigation Better user interface: search without typing For improving recall in search applications Better search results (like pseudo RF) For better navigation of search results Effective “user recall” will be higher For speeding up vector space retrieval Cluster-based retrieval gives faster searchSec. 16.1CS6322: Information RetrievalCS6322: Information RetrievalYahoo! Hierarchy isn’t clustering but is the kind of output you want from clusteringdairycropsagronomyforestryAIHCIcraftmissionsbotanyevolutioncellmagnetismrelativitycoursesagriculture biology physics CS space...... ...… (30)www.yahoo.com/Science... ...CS6322: Information RetrievalCS6322: Information RetrievalGoogle News: automatic clustering gives an effective news presentation metaphorCS6322: Information RetrievalCS6322: Information RetrievalScatter/Gather: Cutting, Karger, and PedersenSec. 16.1CS6322: Information RetrievalCS6322: Information RetrievalFor visualizing a document collection and its themes Wise et al, “Visualizing the non-visual” PNNL ThemeScapes, Cartia [Mountain height = cluster size]CS6322: Information RetrievalCS6322: Information RetrievalFor improving search recall Cluster hypothesis - Documents in the same cluster behave similarly with respect to relevance to information needs Therefore, to improve search recall: Cluster docs in corpus a priori When a query matches a doc D, also return other docs in the cluster containing D Hope if we do this: The query “car” will also return docs containing automobile Because clustering grouped together docs containing car with those containing automobile.Why might this happen?Sec. 16.1CS6322: Information RetrievalCS6322: Information RetrievalFor better navigation of search results For grouping search results thematically clusty.com / VivisimoSec. 16.1CS6322: Information RetrievalCS6322: Information RetrievalProblem Statement Define the goal of hard flat clustering. Given:1. A set of documents D={ d1, d2, …, dN}2. A desired number of clusters K3. An objective function that evaluates the quality of clusteringCompute an assignment γ: D → {1, 2, … K} that minimizes the objective function Sometimes we want γ to be surjective (none of the K clusters are empty!!!!) The objective function is defined in terms of similarity or distancebetween documents.CS6322: Information RetrievalCS6322: Information RetrievalIssues for clustering Representation for clustering Document representation Vector space? Normalization? Centroids aren’t length normalized Need a notion of similarity/distance How many clusters? Fixed a priori? Completely data driven? Avoid “trivial” clusters - too large or small If a cluster's too large, then for navigation purposes you've wasted an extra user click without whittling down the set of documents much.Sec. 16.2CS6322: Information RetrievalCS6322: Information RetrievalNotion of similarity/distance Ideal: semantic similarity. Practical: term-statistical similarity We will use cosine similarity. Docs as vectors. For many algorithms, easier to think in terms of a distance (rather than similarity) between docs. We will mostly speak of Euclidean distance But real implementations use cosine similarityCS6322: Information RetrievalCS6322: Information RetrievalClustering Algorithms Flat algorithms Usually start with a random (partial) partitioning Refine it iteratively K means clustering (Model based clustering) Hierarchical algorithms Bottom-up, agglomerative (Top-down, divisive)CS6322: Information RetrievalCS6322: Information RetrievalHard vs. soft clustering Hard clustering: Each document belongs to exactly one cluster More common and easier to do Soft clustering: A document can belong to more than one cluster. Makes more sense for applications like creating browsable hierarchies You may want to put a pair of sneakers in two clusters: (i) sports apparel and (ii) shoes You can only do that with a soft clustering approach. We won’t do soft clustering today. See IIR 16.5, 18CS6322: Information RetrievalCS6322: Information RetrievalPartitioning Algorithms Partitioning method: Construct a partition of ndocuments into a set of K clusters Given: a set of documents and the number K Find: a partition of K clusters that optimizes the chosen partitioning criterion Globally optimal Intractable for many objective functions Ergo, exhaustively enumerate all partitions Effective heuristic methods: K-means and K-medoids algorithmsCS6322: Information RetrievalCS6322: Information RetrievalK-Means Assumes documents are real-valued vectors. Clusters based on centroids (aka the center of gravityor mean) of points in a cluster, c: Reassignment of instances to clusters is based on distance to the current cluster centroids. (Or one can equivalently phrase it in terms of similarities)∑∈=cxxcrrr||1(c)µSec. 16.4CS6322: Information RetrievalCS6322: Information RetrievalK-Means Algorithm1. Select K random docs {s1, s2,… sK} as seeds.2. Until clustering converges (or other stopping criterion):a) For each doc di:Assign dito the cluster cjsuch that


View Full Document

UT Dallas CS 6359 - Lecture13

Documents in this Course
Lecture2

Lecture2

63 pages

Lecture3

Lecture3

49 pages

Lecture4

Lecture4

48 pages

Lecture5

Lecture5

47 pages

Lecture6

Lecture6

45 pages

Lecture7

Lecture7

63 pages

Lecture8

Lecture8

77 pages

Lecture9

Lecture9

48 pages

Lecture10

Lecture10

84 pages

Lecture11

Lecture11

45 pages

Lecture12

Lecture12

134 pages

Lecture14

Lecture14

76 pages

Project

Project

2 pages

Chapter_1

Chapter_1

25 pages

Load more
Download Lecture13
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture13 and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture13 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?