UT Dallas CS 6359 - Lecture13 - D3096294

Home> Schools> University of Texas at Dallas> Computer Science (CS) > CS 6359> Lecture13

DOC PREVIEW

UT Dallas CS 6359 - Lecture13

School name University of Texas at Dallas

Course Cs 6359- Object-Oriented Analysis and Design

Pages 62

This preview shows page 1-2-3-4-29-30-31-32-59-60-61-62 out of 62 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 62 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 62 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 62 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 62 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 62 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 62 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 62 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 62 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 62 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 62 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 62 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 62 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 62 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

CS6322: CS6322: Information Retrieval Information Retrieval Sanda HarabagiuSanda HarabagiuLecture 13: ClusteringLecture 13: ClusteringCS6322: Information RetrievalCS6322: Information RetrievalToday’s Topic: Clustering Document clustering Motivations Document representations Success criteria Clustering algorithms Partitional HierarchicalCS6322: Information RetrievalCS6322: Information RetrievalWhat is clustering? Clustering: the process of grouping a set of objects into classes of similar objects Documents within a cluster should be similar. Documents from different clusters should be dissimilar. The commonest form of unsupervised learning Unsupervised learning = learning from raw data, as opposed to supervised data where a classification of examples is given A common and important task that finds many applications in IR and other placesCh. 16CS6322: Information RetrievalCS6322: Information RetrievalA data set with clear cluster structure How would you design an algorithm for finding the three clusters in this case?Ch. 16CS6322: Information RetrievalCS6322: Information RetrievalApplications of clustering in IR Whole corpus analysis/navigation Better user interface: search without typing For improving recall in search applications Better search results (like pseudo RF) For better navigation of search results Effective “user recall” will be higher For speeding up vector space retrieval Cluster-based retrieval gives faster searchSec. 16.1CS6322: Information RetrievalCS6322: Information RetrievalYahoo! Hierarchy isn’t clustering but is the kind of output you want from clusteringdairycropsagronomyforestryAIHCIcraftmissionsbotanyevolutioncellmagnetismrelativitycoursesagriculture biology physics CS space...... ...… (30)www.yahoo.com/Science... ...CS6322: Information RetrievalCS6322: Information RetrievalGoogle News: automatic clustering gives an effective news presentation metaphorCS6322: Information RetrievalCS6322: Information RetrievalScatter/Gather: Cutting, Karger, and PedersenSec. 16.1CS6322: Information RetrievalCS6322: Information RetrievalFor visualizing a document collection and its themes Wise et al, “Visualizing the non-visual” PNNL ThemeScapes, Cartia [Mountain height = cluster size]CS6322: Information RetrievalCS6322: Information RetrievalFor improving search recall Cluster hypothesis - Documents in the same cluster behave similarly with respect to relevance to information needs Therefore, to improve search recall: Cluster docs in corpus a priori When a query matches a doc D, also return other docs in the cluster containing D Hope if we do this: The query “car” will also return docs containing automobile Because clustering grouped together docs containing car with those containing automobile.Why might this happen?Sec. 16.1CS6322: Information RetrievalCS6322: Information RetrievalFor better navigation of search results For grouping search results thematically clusty.com / VivisimoSec. 16.1CS6322: Information RetrievalCS6322: Information RetrievalProblem Statement Define the goal of hard flat clustering. Given:1. A set of documents D={ d1, d2, …, dN}2. A desired number of clusters K3. An objective function that evaluates the quality of clusteringCompute an assignment γ: D → {1, 2, … K} that minimizes the objective function Sometimes we want γ to be surjective (none of the K clusters are empty!!!!) The objective function is defined in terms of similarity or distancebetween documents.CS6322: Information RetrievalCS6322: Information RetrievalIssues for clustering Representation for clustering Document representation Vector space? Normalization? Centroids aren’t length normalized Need a notion of similarity/distance How many clusters? Fixed a priori? Completely data driven? Avoid “trivial” clusters - too large or small If a cluster's too large, then for navigation purposes you've wasted an extra user click without whittling down the set of documents much.Sec. 16.2CS6322: Information RetrievalCS6322: Information RetrievalNotion of similarity/distance Ideal: semantic similarity. Practical: term-statistical similarity We will use cosine similarity. Docs as vectors. For many algorithms, easier to think in terms of a distance (rather than similarity) between docs. We will mostly speak of Euclidean distance But real implementations use cosine similarityCS6322: Information RetrievalCS6322: Information RetrievalClustering Algorithms Flat algorithms Usually start with a random (partial) partitioning Refine it iteratively K means clustering (Model based clustering) Hierarchical algorithms Bottom-up, agglomerative (Top-down, divisive)CS6322: Information RetrievalCS6322: Information RetrievalHard vs. soft clustering Hard clustering: Each document belongs to exactly one cluster More common and easier to do Soft clustering: A document can belong to more than one cluster. Makes more sense for applications like creating browsable hierarchies You may want to put a pair of sneakers in two clusters: (i) sports apparel and (ii) shoes You can only do that with a soft clustering approach. We won’t do soft clustering today. See IIR 16.5, 18CS6322: Information RetrievalCS6322: Information RetrievalPartitioning Algorithms Partitioning method: Construct a partition of ndocuments into a set of K clusters Given: a set of documents and the number K Find: a partition of K clusters that optimizes the chosen partitioning criterion Globally optimal Intractable for many objective functions Ergo, exhaustively enumerate all partitions Effective heuristic methods: K-means and K-medoids algorithmsCS6322: Information RetrievalCS6322: Information RetrievalK-Means Assumes documents are real-valued vectors. Clusters based on centroids (aka the center of gravityor mean) of points in a cluster, c: Reassignment of instances to clusters is based on distance to the current cluster centroids. (Or one can equivalently phrase it in terms of similarities)∑∈=cxxcrrr||1(c)µSec. 16.4CS6322: Information RetrievalCS6322: Information RetrievalK-Means Algorithm1. Select K random docs {s1, s2,… sK} as seeds.2. Until clustering converges (or other stopping criterion):a) For each doc di:Assign dito the cluster cjsuch that

View Full Document