UCSD CSE 254 - Document Clustering Using Word Clusters (31 pages)

Previewing pages 1, 2, 14, 15, 30, 31 of 31 page document View the full content.
View Full Document

Document Clustering Using Word Clusters



Previewing pages 1, 2, 14, 15, 30, 31 of actual document.

View the full content.
View Full Document
View Full Document

Document Clustering Using Word Clusters

58 views

Other


Pages:
31
School:
University of California, San Diego
Course:
Cse 254 - Seminar on Learning Algorithms

Unformatted text preview:

Document Clustering Using Word Clusters via the Information Bottleneck Method Noam Slonim and Naftali Tishby Conference on Research and Development in Information Retrieval SIGIR 2000 Presented by Bret Ehlert May 14 2002 Document Clustering Document clustering is closely related to text classification Traditional Clustering Methods Represent a document as a vector of weights for the terms that occur in the document w1 doc1 0 0 doc2 0 6 w2 0 75 0 21 w3 w124080 w124081 wordn 0 0 0 0 0 13 0 0 0 0 0 36 0 0 0 0 This representation has many disadvantages High dimensionality Sparseness Loss of word ordering information Clustering documents using the distances between pairs of vectors is troublesome The Information Bottleneck is an alternative method that does not rely on vector distances Dimensionality Reduction Dimensionality reduction is beneficial for improved accuracy and efficiency when clustering documents Latent semantic indexing LSI Information Gain and Mutual Information Measures Chi Squared Statistic Term Strength Algorithm Distributional Clustering Cluster words based on their distribution across documents The Information Bottleneck is a distributional clustering method The Information Bottleneck A distributional clustering method Used to cluster words reducing the dimensionality of document representations Used to cluster documents The agglomerative algorithm presented in the paper is a special case of a general approach Tishby Pereira and Bialek The Information Bottleneck Method 37 th Annual Allerton Conference on Communication 1999 The Information Bottleneck X X Find a mapping between x X and x X characterized by a conditional probability distribution p x x For example if X is the set of words X is a new representation of words where X X This mapping induces a soft partitioning of X each x X maps to x X with probability p x x p x 1 x1 0 8 x1 p x 2 x1 0 2 x2 p x 1 x2 0 0 x 1 x3 p x 2 x2 1 0 x 2 p x 1 x3 0 6 p x 2 x3 0 4 The Information Bottleneck Y X X Suppose



View Full Document

Access the best Study Guides, Lecture Notes and Practice Exams

Loading Unlocking...
Login

Join to view Document Clustering Using Word Clusters and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Document Clustering Using Word Clusters and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?