Unformatted text preview:

Flat ClusteringToday’s Topic: ClusteringWhat is clustering?A data set with clear cluster structureWhy cluster documents?Yahoo! Hierarchy isn’t clustering but is the kind of output you want from clusteringGoogle News: automatic clustering gives an effective news presentation metaphorScatter/Gather: Cutting, Karger, and PedersenFor visualizing a document collection and its themesFor improving search recallFor better navigation of search resultsIssues for clusteringWhat makes docs “related”?Clustering AlgorithmsHard vs. soft clusteringPartitioning AlgorithmsK-MeansK-Means AlgorithmK Means Example (K=2)Termination conditionsConvergenceConvergence of K-MeansTime ComplexitySeed ChoiceK-means issues, variations, etc.How Many Clusters?Hierarchical Clustering“The Curse of Dimensionality”Hierarchical ClusteringPowerPoint PresentationHierarchical Clustering algorithmsHierarchical Agglomerative Clustering (HAC) AlgorithmDendrogram: Document ExampleKey notion: cluster representativeExample: n=6, k=3, closest pair of centroidsOutliers in centroid computationClosest pair of clustersSingle Link Agglomerative ClusteringSingle Link ExampleComplete Link Agglomerative ClusteringComplete Link ExampleComputational ComplexityGroup Average Agglomerative ClusteringComputing Group Average SimilarityMedoid as Cluster RepresentativeEfficiency: “Using approximations”Major issue - labelingHow to Label ClustersLabelingWhat is a Good Clustering?External criteria for clustering qualityExternal Evaluation of Cluster QualityPurity exampleRand IndexRand index: symmetric versionRand Index example: 0.68Final word and resourcesSKIP WHAT FOLLOWSTerm vs. document spaceTerm vs. document spaceMulti-lingual docsFeature selectionEvaluation of clusteringApproaches to evaluatingAnecdotal evaluationGround “truth” comparisonMicroeconomic viewpointEvaluation example: Cluster retrievalEvaluationPrasad L16FlatCluster 1Flat ClusteringAdapted from Slides by Prabhakar Raghavan, Christopher Manning, Ray Mooney and Soumen Chakrabarti2Today’s Topic: ClusteringDocument clusteringMotivationsDocument representationsSuccess criteriaClustering algorithmsPartitional (Flat)Hierarchical (Tree)Prasad L16FlatCluster3What is clustering?Clustering: the process of grouping a set of objects into classes of similar objectsDocuments within a cluster should be similar.Documents from different clusters should be dissimilar.The commonest form of unsupervised learningUnsupervised learning = learning from raw data, as opposed to supervised data where a classification of examples is givenA common and important task that finds many applications in IR and other placesPrasad L16FlatClusterA data set with clear cluster structureHow would you design an algorithm for finding the three clusters in this case?5Why cluster documents?Whole corpus analysis/navigationBetter user interfaceFor improving recall in search applicationsBetter search results (pseudo RF)For better navigation of search resultsEffective “user recall” will be higherFor speeding up vector space retrievalFaster searchPrasad L16FlatCluster6Yahoo! Hierarchy isn’t clustering but is the kind of output you want from clusteringdairycropsagronomyforestryAIHCIcraftmissionsbotanyevolutioncellmagnetismrelativitycoursesagriculture biology physics CS space...... ...… (30)www.yahoo.com/Science... ...Prasad L16FlatClusterGoogle News: automatic clustering gives an effective news presentation metaphorScatter/Gather: Cutting, Karger, and PedersenFor visualizing a document collection and its themesWise et al, “Visualizing the non-visual” PNNLThemeScapes, Cartia[Mountain height = cluster size]10For improving search recallCluster hypothesis - Documents in the same cluster behave similarly with respect to relevance to information needsTherefore, to improve search recall:Cluster docs in corpus a prioriWhen a query matches a doc D, also return other docs in the cluster containing DExample: The query “car” will also return docs containing automobileBecause clustering grouped together docs containing car with those containing automobile.Why might this happen?Prasad11For better navigation of search resultsFor grouping search results thematicallyclusty.com / VivisimoPrasad L16FlatCluster12Issues for clusteringRepresentation for clusteringDocument representationVector space? Normalization?Need a notion of similarity/distanceHow many clusters?Fixed a priori?Completely data driven?Avoid “trivial” clusters - too large or smallPrasad L16FlatCluster13What makes docs “related”? Ideal: semantic similarity.Practical: statistical similarityDocs as vectors.For many algorithms, easier to think in terms of a distance (rather than similarity) between docs.We can use cosine similarity (alternatively, Euclidean Distance).Prasad L16FlatCluster14Clustering AlgorithmsPartitional (Flat) algorithmsUsually start with a random (partial) partitionRefine it iterativelyK means clustering(Model based clustering)Hierarchical (Tree) algorithmsBottom-up, agglomerative(Top-down, divisive)Prasad L16FlatClusterHard vs. soft clusteringHard clustering: Each document belongs to exactly one clusterMore common and easier to doSoft clustering: A document can belong to more than one cluster.Makes more sense for applications like creating browsable hierarchiesYou may want to put a pair of sneakers in two clusters: (i) sports apparel and (ii) shoes16Partitioning AlgorithmsPartitioning method: Construct a partition of n documents into a set of K clustersGiven: a set of documents and the number K Find: a partition of K clusters that optimizes the chosen partitioning criterionGlobally optimal: exhaustively enumerate all partitionsEffective heuristic methods: K-means and K-medoids algorithmsPrasad L16FlatCluster17K-MeansAssumes documents are real-valued vectors.Clusters based on centroids (aka the center of gravity or mean) of points in a cluster, c.Reassignment of instances to clusters is based on distance to the current cluster centroids.(Or one can equivalently phrase it in terms of similarities)Prasad L16FlatClustercxxc||1(c)μ18K-Means AlgorithmSelect K random docs {s1, s2,… sK} as seeds.Until clustering converges or other stopping criterion: For each doc di: Assign di to the cluster cj such


View Full Document

Wright CS 707 - Flat Clustering

Download Flat Clustering
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Flat Clustering and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Flat Clustering 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?