DOC PREVIEW
Stanford CS 276 - Introduction to Information Retrieval

This preview shows page 1-2-3-23-24-25-26-46-47-48 out of 48 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 48 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 48 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 48 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 48 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 48 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 48 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 48 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 48 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 48 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 48 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 48 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Slide 1Today’s Topic: ClusteringWhat is clustering?A data set with clear cluster structureApplications of clustering in IRYahoo! Hierarchy isn’t clustering but is the kind of output you want from clusteringGoogle News: automatic clustering gives an effective news presentation metaphorScatter/Gather: Cutting, Karger, and PedersenFor visualizing a document collection and its themesFor improving search recallSlide 11Issues for clusteringNotion of similarity/distanceClustering AlgorithmsHard vs. soft clusteringPartitioning AlgorithmsK-MeansK-Means AlgorithmK Means Example (K=2)Termination conditionsConvergenceConvergence of K-MeansSlide 23Time ComplexitySeed ChoiceK-means issues, variations, etc.How Many Clusters?K not specified in advanceSlide 29Penalize lots of clustersHierarchical ClusteringDendrogram: Hierarchical ClusteringHierarchical Agglomerative Clustering (HAC)Closest pair of clustersSingle Link Agglomerative ClusteringSingle Link ExampleComplete LinkComplete Link ExampleComputational ComplexityGroup AverageComputing Group Average SimilarityWhat Is A Good Clustering?External criteria for clustering qualityExternal Evaluation of Cluster QualityPurity exampleRand Index measures between pair decisions. Here RI = 0.68Rand index and Cluster F-measureFinal word and resourcesIntroduction to Information RetrievalIntroduction to Information Retrieval Introduction toInformation RetrievalCS276: Information Retrieval and Web SearchPandu Nayak and Prabhakar RaghavanLecture 12: ClusteringIntroduction to Information RetrievalIntroduction to Information Retrieval Today’s Topic: ClusteringDocument clusteringMotivationsDocument representationsSuccess criteriaClustering algorithmsPartitionalHierarchicalIntroduction to Information RetrievalIntroduction to Information Retrieval What is clustering?Clustering: the process of grouping a set of objects into classes of similar objectsDocuments within a cluster should be similar.Documents from different clusters should be dissimilar.The commonest form of unsupervised learningUnsupervised learning = learning from raw data, as opposed to supervised data where a classification of examples is givenA common and important task that finds many applications in IR and other placesCh. 16Introduction to Information RetrievalIntroduction to Information Retrieval A data set with clear cluster structureHow would you design an algorithm for finding the three clusters in this case?Ch. 16Introduction to Information RetrievalIntroduction to Information Retrieval Applications of clustering in IRWhole corpus analysis/navigationBetter user interface: search without typingFor improving recall in search applicationsBetter search results (like pseudo RF)For better navigation of search resultsEffective “user recall” will be higherFor speeding up vector space retrievalCluster-based retrieval gives faster searchSec. 16.1Introduction to Information RetrievalIntroduction to Information Retrieval Yahoo! Hierarchy isn’t clustering but is the kind of output you want from clusteringdairycropsagronomyforestryAIHCIcraftmissionsbotanyevolutioncellmagnetismrelativitycoursesagriculture biology physics CS space...... ...… (30)www.yahoo.com/Science... ...Introduction to Information RetrievalIntroduction to Information Retrieval Google News: automatic clustering gives an effective news presentation metaphorIntroduction to Information RetrievalIntroduction to Information Retrieval Scatter/Gather: Cutting, Karger, and PedersenSec. 16.1Introduction to Information RetrievalIntroduction to Information Retrieval For visualizing a document collection and its themesWise et al, “Visualizing the non-visual” PNNLThemeScapes, Cartia[Mountain height = cluster size]Introduction to Information RetrievalIntroduction to Information Retrieval For improving search recallCluster hypothesis - Documents in the same cluster behave similarly with respect to relevance to information needsTherefore, to improve search recall:Cluster docs in corpus a prioriWhen a query matches a doc D, also return other docs in the cluster containing DHope if we do this: The query “car” will also return docs containing automobileBecause clustering grouped together docs containing car with those containing automobile.Why might this happen?Sec. 16.1Introduction to Information RetrievalIntroduction to Information Retrieval 11yippy.com – grouping search resultsIntroduction to Information RetrievalIntroduction to Information Retrieval Issues for clusteringRepresentation for clusteringDocument representationVector space? Normalization?Centroids aren’t length normalizedNeed a notion of similarity/distanceHow many clusters?Fixed a priori?Completely data driven?Avoid “trivial” clusters - too large or smallIf a cluster's too large, then for navigation purposes you've wasted an extra user click without whittling down the set of documents much.Sec. 16.2Introduction to Information RetrievalIntroduction to Information Retrieval Notion of similarity/distanceIdeal: semantic similarity.Practical: term-statistical similarityWe will use cosine similarity.Docs as vectors.For many algorithms, easier to think in terms of a distance (rather than similarity) between docs.We will mostly speak of Euclidean distanceBut real implementations use cosine similarityIntroduction to Information RetrievalIntroduction to Information Retrieval Clustering AlgorithmsFlat algorithmsUsually start with a random (partial) partitioningRefine it iterativelyK means clustering(Model based clustering)Hierarchical algorithmsBottom-up, agglomerative(Top-down, divisive)Introduction to Information RetrievalIntroduction to Information Retrieval Hard vs. soft clusteringHard clustering: Each document belongs to exactly one clusterMore common and easier to doSoft clustering: A document can belong to more than one cluster.Makes more sense for applications like creating browsable hierarchiesYou may want to put a pair of sneakers in two clusters: (i) sports apparel and (ii) shoesYou can only do that with a soft clustering approach.We won’t do soft clustering today. See IIR 16.5, 18Introduction to Information RetrievalIntroduction to Information Retrieval Partitioning AlgorithmsPartitioning method: Construct a partition of


View Full Document

Stanford CS 276 - Introduction to Information Retrieval

Documents in this Course
Load more
Download Introduction to Information Retrieval
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Introduction to Information Retrieval and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Introduction to Information Retrieval 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?