Wright CS 707 - Flat Clustering - D523954

Home> Schools> Wright State University> Counseling (CS) > CS 707> Flat Clustering

Wright CS 707 - Flat Clustering

Pages 69

Download Save

Unformatted text preview:

Flat ClusteringToday’s Topic: ClusteringWhat is clustering?A data set with clear cluster structureWhy cluster documents?Yahoo! Hierarchy isn’t clustering but is the kind of output you want from clusteringGoogle News: automatic clustering gives an effective news presentation metaphorScatter/Gather: Cutting, Karger, and PedersenFor visualizing a document collection and its themesFor improving search recallFor better navigation of search resultsIssues for clusteringWhat makes docs “related”?Clustering AlgorithmsHard vs. soft clusteringPartitioning AlgorithmsK-MeansK-Means AlgorithmK Means Example (K=2)Termination conditionsConvergenceConvergence of K-MeansTime ComplexitySeed ChoiceK-means issues, variations, etc.How Many Clusters?Hierarchical Clustering“The Curse of Dimensionality”Hierarchical ClusteringPowerPoint PresentationHierarchical Clustering algorithmsHierarchical Agglomerative Clustering (HAC) AlgorithmDendrogram: Document ExampleKey notion: cluster representativeExample: n=6, k=3, closest pair of centroidsOutliers in centroid computationClosest pair of clustersSingle Link Agglomerative ClusteringSingle Link ExampleComplete Link Agglomerative ClusteringComplete Link ExampleComputational ComplexityGroup Average Agglomerative ClusteringComputing Group Average SimilarityMedoid as Cluster RepresentativeEfficiency: “Using approximations”Major issue - labelingHow to Label ClustersLabelingWhat is a Good Clustering?External criteria for clustering qualityExternal Evaluation of Cluster QualityPurity exampleRand IndexRand index: symmetric versionRand Index example: 0.68Final word and resourcesSKIP WHAT FOLLOWSTerm vs. document spaceTerm vs. document spaceMulti-lingual docsFeature selectionEvaluation of clusteringApproaches to evaluatingAnecdotal evaluationGround “truth” comparisonMicroeconomic viewpointEvaluation example: Cluster retrievalEvaluationPrasad L16FlatCluster 1Flat ClusteringAdapted from Slides by Prabhakar Raghavan, Christopher Manning, Ray Mooney and Soumen Chakrabarti2Today’s Topic: ClusteringDocument clusteringMotivationsDocument representationsSuccess criteriaClustering algorithmsPartitional (Flat)Hierarchical (Tree)Prasad L16FlatCluster3What is clustering?Clustering: the process of grouping a set of objects into classes of similar objectsDocuments within a cluster should be similar.Documents from different clusters should be dissimilar.The commonest form of unsupervised learningUnsupervised learning = learning from raw data, as opposed to supervised data where a classification of examples is givenA common and important task that finds many applications in IR and other placesPrasad L16FlatClusterA data set with clear cluster structureHow would you design an algorithm for finding the three clusters in this case?5Why cluster documents?Whole corpus analysis/navigationBetter user interfaceFor improving recall in search applicationsBetter search results (pseudo RF)For better navigation of search resultsEffective “user recall” will be higherFor speeding up vector space retrievalFaster searchPrasad L16FlatCluster6Yahoo! Hierarchy isn’t clustering but is the kind of output you want from clusteringdairycropsagronomyforestryAIHCIcraftmissionsbotanyevolutioncellmagnetismrelativitycoursesagriculture biology physics CS space...... ...… (30)www.yahoo.com/Science... ...Prasad L16FlatClusterGoogle News: automatic clustering gives an effective news presentation metaphorScatter/Gather: Cutting, Karger, and PedersenFor visualizing a document collection and its themesWise et al, “Visualizing the non-visual” PNNLThemeScapes, Cartia[Mountain height = cluster size]10For improving search recallCluster hypothesis - Documents in the same cluster behave similarly with respect to relevance to information needsTherefore, to improve search recall:Cluster docs in corpus a prioriWhen a query matches a doc D, also return other docs in the cluster containing DExample: The query “car” will also return docs containing automobileBecause clustering grouped together docs containing car with those containing automobile.Why might this happen?Prasad11For better navigation of search resultsFor grouping search results thematicallyclusty.com / VivisimoPrasad L16FlatCluster12Issues for clusteringRepresentation for clusteringDocument representationVector space? Normalization?Need a notion of similarity/distanceHow many clusters?Fixed a priori?Completely data driven?Avoid “trivial” clusters - too large or smallPrasad L16FlatCluster13What makes docs “related”? Ideal: semantic similarity.Practical: statistical similarityDocs as vectors.For many algorithms, easier to think in terms of a distance (rather than similarity) between docs.We can use cosine similarity (alternatively, Euclidean Distance).Prasad L16FlatCluster14Clustering AlgorithmsPartitional (Flat) algorithmsUsually start with a random (partial) partitionRefine it iterativelyK means clustering(Model based clustering)Hierarchical (Tree) algorithmsBottom-up, agglomerative(Top-down, divisive)Prasad L16FlatClusterHard vs. soft clusteringHard clustering: Each document belongs to exactly one clusterMore common and easier to doSoft clustering: A document can belong to more than one cluster.Makes more sense for applications like creating browsable hierarchiesYou may want to put a pair of sneakers in two clusters: (i) sports apparel and (ii) shoes16Partitioning AlgorithmsPartitioning method: Construct a partition of n documents into a set of K clustersGiven: a set of documents and the number K Find: a partition of K clusters that optimizes the chosen partitioning criterionGlobally optimal: exhaustively enumerate all partitionsEffective heuristic methods: K-means and K-medoids algorithmsPrasad L16FlatCluster17K-MeansAssumes documents are real-valued vectors.Clusters based on centroids (aka the center of gravity or mean) of points in a cluster, c.Reassignment of instances to clusters is based on distance to the current cluster centroids.(Or one can equivalently phrase it in terms of similarities)Prasad L16FlatClustercxxc||1(c)μ18K-Means AlgorithmSelect K random docs {s1, s2,… sK} as seeds.Until clustering converges or other stopping criterion: For each doc di: Assign di to the cluster cj such

View Full Document


School:
Email:
New Password:
Confirm Password:

Wright CS 707 - Flat Clustering

Sign up for free to view:

Please select your school