10 701 Project Final Report Hierarchical Bayesian Models for Text Classification Yangbo Zhu yangboz cs cmu edu 1 Introduction Hierarchical structure is a natural and effective way of organizing information Well known examples include the Dewey Decimal System Yahoo Directory and computer file systems We view such a hierarchy as a tree which is consisted of a root node certain levels of intermediate nodes and leaf nodes Suppose the documents to be classified could be fit into a topic tree Intuitively if we know the parents of a leaf node we can describe the leaf more accurately Therefore we can use hierarchical information to improve the performance of text classification In this project we propose a new method called General Hierarchical Shrinkage GHS and compare it with the original Hierarchical Shrinkage HS method and Naive Bayes The rest of this paper is organized as follows Section 2 reviews related work Section 3 describes the GHS algorithm and some technical details Section 4 presents experimental results and compares GHS with HS and Naive Bayes along with some preliminary discussion Section 5 concludes this paper 2 Related Work This project is inspired by the Hierarchical Shrinkage method in McCallum 1998 In HS we first train a Naive Bayes model for each class of documents Each class is represented with a leaf node in the topic tree Given a non leaf node A the model of A is the mean of all leaf nodes in the subtree with A as its root Therefore the model of root node is the mean of all classes Furthermore we add a super root on top of the original root node with a uniform conditional distribution After building the tree we assume the model of each class is a linear combination of all the nodes along the path from leaf to the super root The weights for linear combination can be optimized using a simple Expectation Maximization EM method The HS method arises from a general parameter estimator called Shrinkage estimator or James Stein estimator which was discovered by Stein 1956 and later extended by James Stein 1961 The basic idea of Shrinkage Estimator is as follows when estimating a group of parameters 1 nP we can reduce the mean square error MSE by shrinking i towards their mean i i even if i are completely independent Since this statement contradicts to people s common sense it s called Stein s Paradox We will briefly review it in Section 3 2 Koller Sahami 1997 proposes another hierarchical method for text classification which is called Pachinko Machine PM PM also group classes into a topic tree and compute the model of each node based on all documents belonging to it However PM does not combine different nodes together to produce mixture models Instead it takes a greedy top down search strategy to locate the best leaf node for the document The search start at root At each node A PM picks a sub branch of A according to certain criteria PM repeats this action until it reaches a leaf Therefore the accuracy of the entire process is the product of accuracy on all levels For example if the accuracy on each level is 0 9 and there are three levels then the final accuracy is 0 93 0 73 3 3 1 Methods Naive Bayes We assume a document is generated by two steps first choose a class cj with probability P cj then generate its bag of words according to the conditional distribution P w cj Based on this assumption we use the algorithm in Table 6 2 of Mitchell 1997 to train Naive Bayes classifiers Given a labeled document di the probability that it belongs to cj is P cj di 0 1 We estimate the prior distribution of class cj P cj D X P cj di i 1 D 1 where D is the number of documents The conditional distribution is estimated by P wt cj P D 1 i 1 N wt di P cj di P V P D V s 1 i 1 N ws di P cj di 2 where V is the vocabulary size N wt di is the term frequency TF os wt in di After the classifier is built we classify future documents as c di arg max P cj di arg max cj cj Y P wt cj P cj P wt 3 wt di 3 2 James Stein Estimator The James Stein estimator is simple to state but hard to believe at first glance Assume there are a group of variables xi i 1 n which follow Gaussian distribution N 2 I where I is the identity matrix We are interested in estimating the set of parameters based on observation x X A natural and intuitive estimate is the maximum likelihood estimation MLE X Stein 1956 demonstrated that in terms of minimizing mean square error MSE E k k2 the James Stein estimator is better than MLE The original James Stein estimator shrinks towards a prior 0 when n 2 1 n 2 2 X kXk2 4 Notice that when n 2 MLE is the best A generalized James Stein estimator can shrink towards non zero prior like the mean P X i xi when n 3 X 1 n 3 2 X X kX X k2 5 The reason why people are shocked by Stein s claim is that each i is affected by all variables in x even if they are completely independent For example let 1 be the weight of cookies in a given box 2 be the height of Mount Everest and 3 be the speed of light assume the results of our measurement x follow Gaussian distribution described above The James Stein estimator can get better MSE than maximum likelihood estimator It means that the expectation of total MSE is reduced while the MSE of each individual i could be better or worse 3 3 General Hierarchical Shrinkage Model Recall that in HS method the final model j of class cj is a linear combination of all the nodes on the path from leaf to root The model of each intermediate node in the tree is again a linear combination of its children Therefore j is actually a linear combination of all classes j C X k 1 kj k C X kj 1 6 k 1 where C is the number of classes The weights kj is constrained by the hierarchical structure For example if classes ck1 k2 and ck2 are siblings of cj i e They share the same parent node with cj then k1 j j Based on above observation a straightforward generalization of HS method is to give kj more freedom The maximum freedom for kj is that they can take any non negative P C value as long as k 1 kj 1 Like in HS method we can still train the weights using EM algorithm although the number of weights increases from C L in HS L is the depth of tree to C 2 in GHS The training algorithm is very similar to that for HS In practice in order to use all …
View Full Document