New version page

clustering-lecture

Upgrade to remove ads

This preview shows page 1-2-3-4-29-30-31-32-59-60-61-62 out of 62 pages.

Save
View Full Document
Premium Document
Do you want full access? Go Premium and unlock all 62 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 62 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 62 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 62 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 62 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 62 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 62 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 62 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 62 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 62 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 62 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 62 pages.
Access to all documents
Download any document
Ad free experience

Upgrade to remove ads
Unformatted text preview:

15-505 Internet Search TechnologiesWhat is clustering?ClusteringSlide 4Clustering – Reference matchingSlide 6Citation rankingCitation graph browsingClustering: Navigation of search resultsClustering: Corpus browsingClustering considerationsWhat makes docs “related”?What are we optimizing?Clustering AlgorithmsPartitioning AlgorithmsK-MeansK-Means AlgorithmK Means Example (K=2)Termination conditionsConvergenceTime ComplexitySeed ChoiceHow Many Clusters?K not specified in advanceSlide 25Penalize lots of clustersHierarchical ClusteringHierarchical Clustering algorithmsHierarchical Agglomerative Clustering (HAC)Slide 30Slide 31Closest pair of clustersSingle Link Agglomerative ClusteringSingle Link ExampleComplete Link Agglomerative ClusteringComplete Link ExampleKey notion: cluster representativeCentroid-based SimilarityComputational ComplexityMajor issue - labelingHow to Label ClustersLabelingScaling up to large datasetsEfficient large-scale clusteringExpensive Distance Metric for TextString edit (Levenstein) distanceLevenstein distance - exampleSlide 48Computing Levenstein distanceSlide 50Slide 51Large Clustering ProblemsThe Canopies ApproachIllustrating CanopiesOverlapping CanopiesCreating canopies with two thresholdsUsing canopies with HACInexpensive Distance Metric for TextSlide 59Computational SavingsCanopiesPreserving Good Clustering15-505Internet Search TechnologiesLecture 8: ClusteringKamal NigamSlides adapted from Chris Manning, Prabhakar Raghavan, and Hinrich Schütze (http://www-csli.stanford.edu/~hinrich/information-retrieval-book.html),William Cohen (www.cs.cmu.edu/~wcohen/Matching-2.ppt), & Ray Mooney (http://www.cs.utexas.edu/~mooney/cs391L/slides/clustering.ppt)What is clustering?Clustering: the process of grouping a set of objects into classes of similar objectsMost common form of unsupervised learningUnsupervised learning = learning from raw data, as opposed to supervised data where a classification of examples is givenClusteringClusteringClustering – Reference matchingFahlman, Scott & Lebiere, Christian (1989). The cascade-correlation learning architecture. In Touretzky, D., editor, Advances in Neural Information Processing Systems (volume 2), (pp. 524-532), San Mateo, CA. Morgan Kaufmann.Fahlman, S.E. and Lebiere, C., “The Cascade Correlation Learning Architecture,” NIPS, Vol. 2, pp. 524-532, Morgan Kaufmann, 1990.Fahlman, S. E. (1991) The recurrent cascade-correlation learning architecture. In Lippman, R.P. Moody, J.E., and Touretzky, D.S., editors, NIPS 3, 190-205.Clustering – Reference matchingFahlman, Scott & Lebiere, Christian (1989). The cascade-correlation learning architecture. In Touretzky, D., editor, Advances in Neural Information Processing Systems (volume 2), (pp. 524-532), San Mateo, CA. Morgan Kaufmann.Fahlman, S.E. and Lebiere, C., “The Cascade Correlation Learning Architecture,” NIPS, Vol. 2, pp. 524-532, Morgan Kaufmann, 1990.Fahlman, S. E. (1991) The recurrent cascade-correlation learning architecture. In Lippman, R.P. Moody, J.E., and Touretzky, D.S., editors, NIPS 3, 190-205.Citation rankingCitation graph browsingClustering: Navigation of search resultsFor grouping search results thematicallyclusty.com / VivisimoClustering: Corpus browsingdairycropsagronomyforestryAIHCIcraftmissionsbotanyevolutioncellmagnetismrelativitycoursesagriculture biology physics CS space...... ...… (30)www.yahoo.com/Science... ...Clustering considerationsWhat does it mean for objects to be similar?What algorithm and approach do we take?Top-down: k-meansBottom-up: hierarchical agglomerative clustering Do we need a hierarchical arrangement of clusters?How many clusters? Can we label or name the clusters?How do we make it efficient and scalable?What makes docs “related”? Ideal: semantic similarity.Practical: statistical similarityTreat documents as vectorsFor many algorithms, easier to think in terms of a distance (rather than similarity) between docs.Think of either cosine similarity or Euclidean distanceWhat are we optimizing?Given: Final number of clustersOptimize:“Tightness” of clusters{average/min/max/} distance of points to each other in the same cluster?{average/min/max} distance of points to each clusters center?Usually clustering finds heuristic approximationsClustering AlgorithmsPartitional algorithmsUsually start with a random (partial) partitioningRefine it iterativelyK means clusteringModel based clusteringHierarchical algorithmsBottom-up, agglomerativeTop-down, divisivePartitioning AlgorithmsPartitioning method: Construct a partition of n documents into a set of K clustersGiven: a set of documents and the number K Find: a partition of K clusters that optimizes the chosen partitioning criterionGlobally optimal: exhaustively enumerate all partitionsEffective heuristic methods: K-means algorithmsK-MeansAssumes documents are real-valued vectors.Clusters based on centroids (aka the center of gravity or mean) of points in a cluster, c:Reassignment of instances to clusters is based on distance to the current cluster centroids.cxxc||1(c)μK-Means AlgorithmSelect K random seeds.Until clustering converges or other stopping criterion: For each doc di: Assign di to the cluster cj such that dist(xi, sj) is minimal. (Update the seeds to the centroid of each cluster) For each cluster cj sj = (cj) How?K Means Example(K=2)Pick seedsReassign clustersCompute centroidsxxReassign clustersxxxxCompute centroidsReassign clustersConverged!Termination conditionsSeveral possibilities, e.g.,A fixed number of iterations.Doc partition unchanged.Centroid positions don’t change.Does this mean that the docs in a cluster are unchanged?ConvergenceWhy should the K-means algorithm ever reach a fixed point?A state in which clusters don’t change.K-means is a special case of a general procedure known as the Expectation Maximization (EM) algorithm.EM is known to converge.Theoretically, number of iterations could be large.Typically converges quicklyTime ComplexityComputing distance between doc and cluster is O(m) where m is the dimensionality of the vectors.Reassigning clusters: O(Kn) distance computations, or O(Knm).Computing centroids: Each doc gets added once to some centroid: O(nm).Assume these two steps are each done once for I iterations:


Download clustering-lecture
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view clustering-lecture and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view clustering-lecture 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?