UCSD CSE 254 - Document Clustering Using Word Clusters - D2771869

Home> Schools> University of California, San Diego> Computer Science & Engineering (CSE) > CSE 254> Document Clustering Using Word Clusters

UCSD CSE 254 - Document Clustering Using Word Clusters

School name University of California, San Diego

Course Cse 254- Seminar on Learning Algorithms

Pages 31

Download Save

Unformatted text preview:

Document Clustering Using Word Clustersvia the Information Bottleneck MethodNoam Slonim and Naftali TishbyPresented by Bret EhlertMay 14, 2002Conference on Research and Development in Information Retrieval (SIGIR)2000Document Clustering• Document clustering is closely related to text classification.Traditional Clustering Methods• Represent a document as a vector of weights for theterms that occur in the document.• This representation has many disadvantages:– High dimensionality– Sparseness– Loss of word ordering information• Clustering documents using the distances betweenpairs of vectors is troublesome.– The Information Bottleneck is an alternative method thatdoes not rely on vector distances. w1 w2 w3 ... w124080 w124081 … wordndoc1: 0.0 0.75 0.0 ... 0.0 0.13 ... 0.0 doc2: 0.6 0.21 0.0 ... 0.36 0.0 ... 0.0Dimensionality Reduction• Dimensionality reduction is beneficial for improvedaccuracy and efficiency when clustering documents.– Latent semantic indexing (LSI)– Information Gain and Mutual Information Measures– Chi-Squared Statistic– Term Strength Algorithm– Distributional Clustering• Cluster words based on their distribution across documents• The Information Bottleneck is a distributional clustering methodThe Information Bottleneck• A distributional clustering method– Used to cluster words, reducing the dimensionality ofdocument representations.– Used to cluster documents.• The agglomerative algorithm presented in the paper isa special case of a general approach:– Tishby, Pereira, and Bialek. The Information BottleneckMethod. 37-th Annual Allerton Conference onCommunication. 1999.The Information Bottleneck: X → X• Find a mapping between x∈X and x∈X, characterizedby a conditional probability distribution p(x | x).– For example, if X is the set of words, X is a newrepresentation of words where |X| < |X|.• This mapping induces a soft partitioning of X:each x∈X maps to x∈X with probability p(x | x).x1x2x3x1x2p(x1 | x1) = 0.8p(x2 | x1) = 0.2p(x1 | x2) = 0.0p(x2 | x2) = 1.0p(x1 | x3) = 0.6p(x2 | x3) = 0.4The Information Bottleneck: Y → X → X• Suppose the variable X is an observation of Y,where Y is the variable of interest.–x∈X is evidence concerning the outcome y∈Y– For example, x∈X is a word and y∈Y is a document• We want the mapping from x∈X to x∈X to preserveas much information about Y as possible.XXYEntropy• Entropy measures the uncertainty about a discrete randomvariable X:• Entropy defines the lower bound on the number of bits neededto accurately encode X.• Conditional entropy of X given Y describes the amount ofremaining uncertainty about X after an observation of Y:• Relative entropy, or Kullback-Leibler (KL) distance, measuresthe distance between two distributions:∑∈==−=Xx2x)p(X log x)p(XH(X)[]∑∑−==xyy)|p(x log y)p(x,y)|H(XEY)|H(X∑=xKLq(x)p(x)logp(x)q)(p,DMutual Information•The Mutual Information of X and Y measures the amount ofuncertainty about X that is resolved by observing Y:• This is also the relative entropy between the joint distribution ofX and Y and the product of the distributions of X and Y.• Note that I(X,Y) = I(Y,X)∑∑=−=xyp(y) p(x)y)p(x,log y)p(x,Y)I(X,Y)|H(XH(X)Y)I(X,Information Theory Examplesx1x2y10.25 0.25y20.25 0.25x1x2y10.75 0.01y20.05 0.19H(X) = - 0.5 log 0.5 - 0.5 log 0.5 = 1.0H(Y) = 1.0H(X|Y) = - 0.25 log 0.5 - 0.25 log 0.5 - 0.25 log 0.5 - 0.25 log 0.5 = 1.0H(Y|X) = 1.0I(X,Y) = H(X) - H(X|Y) = 1.0 - 1.0 = 0H(X) = - 0.8 log 0.8 - 0.2 log 0.2 = 0.72H(Y) = -0.76 log 0.76 - 0.24 log 0.24 = 0.79H(X|Y) = - 0.75 log 0.98 - 0.01 log 0.01 - 0.05 log 0.21 - 0.19 log 0.79 = 0.25H(Y|X) = - 0.75 log 0.94 - 0.01 log 0.05 - 0.05 log 0.06 - 0.19 log 0.95 = 0.32I(X,Y) = H(X) - H(X|Y) = 0.72 - 0.25 = 0.47I(X,Y) = H(Y) - H(Y|X) = 0.79 - 0.32 = 0.47Examples using two different joint probability distributionsLossy Compression• We want to transmit signal Xusing a compressed version X.– Lower the bit-rate R(compression) by sending X withsome acceptable distortion D.• Two compression problems:– Given a maximum rate R, minimize distortion D– Given an acceptable amount of distortion D, minimize rate R• Solve simultaneously using an unconstrainedLagrangian cost function:• The Information Bottleneck method measuresdistortion using I(Y, X) and compression using I(X, X).λRDJ+=Rate and Distortion Using Mutual Information• Maximizing I(Y, X) minimizes distortion.– Recall that X is an observation of the variable of interest, Y.–I(Y, X) measures the quality of the compression.• Minimizing I(X,X) maximizes compression.– Recall, I(X,X) = H(X) - H(X | X)– H(X) is constant, so minimizing I(X,X) maximizes H(X | X)–H(X | X) defines the minimum additional number of bitsneeded on average to represent x∈X after observing x∈XXXYI(Y,X)I(X,X)I(Y,X)H(X)H(X | X)I(X,X)# of bits transmittedremaining # of bits needed to encode Xtotal # of bits needed to encode XThe Information Bottleneck Cost FunctionMinimize the Lagrangian cost function:• β is the tradeoff between maximizing compressionand minimizing distortion.–If β = 0, every x∈X maps to every x∈X with the same probability• No information is being transmitted– In the limit β → ∞, I(Y,X) is maximized, given the capacity of X.• This results in hard clustering• Note, to find a solution we must fix the cardinality of X.321321maximizeminimize)XI(Y,β)XI(X,L~~−=Solving the Cost FunctionMinimize the Lagrangian cost function:• Remember, we want to find a mapping between x∈Xand x∈X, characterized by a conditional probabilitydistribution p(x | x).Solve for p(x | x).321321maximizeminimize)XI(Y,β)XI(X,L~~−=0x)|xp(L=∂∂~The Solution∑∑∑=xyy)x|p(yx)|p(ylog x)|p(yβexp)xp()x|p(yx)|p(ylog x)|p(yβexp)xp(x)|xp(~~~~~~}()−=44443444421321 )x|p(y and x)|p(y betweenentropy relativeKLtradeoffx all overionnormalizat )x|p(yx),|p(y D βexpβ)Z(x,)xp(x)|xp(~~~~~Equations for p(x) and p(y | x)We also need equations for p(x) and p(y | x)()= )x|p(yx),|p(y D

View Full Document


School:
Email:
New Password:
Confirm Password:

UCSD CSE 254 - Document Clustering Using Word Clusters

Sign up for free to view:

Please select your school