DOC PREVIEW
CMU CS 10701 - Hierarchical Bayesian Models for Text Classification

This preview shows page 1-2 out of 7 pages.

Save
View full document
Premium Document
Do you want full access? Go Premium and unlock all 7 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

10 701 Project Final Report Hierarchical Bayesian Models for Text Classification Yangbo Zhu yangboz cs cmu edu 1 Introduction Hierarchical structure is a natural and effective way of organizing information Well known examples include the Dewey Decimal System Yahoo Directory and computer file systems We view such a hierarchy as a tree which is consisted of a root node certain levels of intermediate nodes and leaf nodes Suppose the documents to be classified could be fit into a topic tree Intuitively if we know the parents of a leaf node we can describe the leaf more accurately Therefore we can use hierarchical information to improve the performance of text classification In this project we propose a new method called General Hierarchical Shrinkage GHS and compare it with the original Hierarchical Shrinkage HS method and Naive Bayes The rest of this paper is organized as follows Section 2 reviews related work Section 3 describes the GHS algorithm and some technical details Section 4 presents experimental results and compares GHS with HS and Naive Bayes along with some preliminary discussion Section 5 concludes this paper 2 Related Work This project is inspired by the Hierarchical Shrinkage method in McCallum 1998 In HS we first train a Naive Bayes model for each class of documents Each class is represented with a leaf node in the topic tree Given a non leaf node A the model of A is the mean of all leaf nodes in the subtree with A as its root Therefore the model of root node is the mean of all classes Furthermore we add a super root on top of the original root node with a uniform conditional distribution After building the tree we assume the model of each class is a linear combination of all the nodes along the path from leaf to the super root The weights for linear combination can be optimized using a simple Expectation Maximization EM method The HS method arises from a general parameter estimator called Shrinkage estimator or James Stein estimator which was discovered by Stein 1956 and later extended by James Stein 1961 The basic idea of Shrinkage Estimator is as follows when estimating a group of parameters 1 nP we can reduce the mean square error MSE by shrinking i towards their mean i i even if i are completely independent Since this statement contradicts to people s common sense it s called Stein s Paradox We will briefly review it in Section 3 2 Koller Sahami 1997 proposes another hierarchical method for text classification which is called Pachinko Machine PM PM also group classes into a topic tree and compute the model of each node based on all documents belonging to it However PM does not combine different nodes together to produce mixture models Instead it takes a greedy top down search strategy to locate the best leaf node for the document The search start at root At each node A PM picks a sub branch of A according to certain criteria PM repeats this action until it reaches a leaf Therefore the accuracy of the entire process is the product of accuracy on all levels For example if the accuracy on each level is 0 9 and there are three levels then the final accuracy is 0 93 0 73 3 3 1 Methods Naive Bayes We assume a document is generated by two steps first choose a class cj with probability P cj then generate its bag of words according to the conditional distribution P w cj Based on this assumption we use the algorithm in Table 6 2 of Mitchell 1997 to train Naive Bayes classifiers Given a labeled document di the probability that it belongs to cj is P cj di 0 1 We estimate the prior distribution of class cj P cj D X P cj di i 1 D 1 where D is the number of documents The conditional distribution is estimated by P wt cj P D 1 i 1 N wt di P cj di P V P D V s 1 i 1 N ws di P cj di 2 where V is the vocabulary size N wt di is the term frequency TF os wt in di After the classifier is built we classify future documents as c di arg max P cj di arg max cj cj Y P wt cj P cj P wt 3 wt di 3 2 James Stein Estimator The James Stein estimator is simple to state but hard to believe at first glance Assume there are a group of variables xi i 1 n which follow Gaussian distribution N 2 I where I is the identity matrix We are interested in estimating the set of parameters based on observation x X A natural and intuitive estimate is the maximum likelihood estimation MLE X Stein 1956 demonstrated that in terms of minimizing mean square error MSE E k k2 the James Stein estimator is better than MLE The original James Stein estimator shrinks towards a prior 0 when n 2 1 n 2 2 X kXk2 4 Notice that when n 2 MLE is the best A generalized James Stein estimator can shrink towards non zero prior like the mean P X i xi when n 3 X 1 n 3 2 X X kX X k2 5 The reason why people are shocked by Stein s claim is that each i is affected by all variables in x even if they are completely independent For example let 1 be the weight of cookies in a given box 2 be the height of Mount Everest and 3 be the speed of light assume the results of our measurement x follow Gaussian distribution described above The James Stein estimator can get better MSE than maximum likelihood estimator It means that the expectation of total MSE is reduced while the MSE of each individual i could be better or worse 3 3 General Hierarchical Shrinkage Model Recall that in HS method the final model j of class cj is a linear combination of all the nodes on the path from leaf to root The model of each intermediate node in the tree is again a linear combination of its children Therefore j is actually a linear combination of all classes j C X k 1 kj k C X kj 1 6 k 1 where C is the number of classes The weights kj is constrained by the hierarchical structure For example if classes ck1 k2 and ck2 are siblings of cj i e They share the same parent node with cj then k1 j j Based on above observation a straightforward generalization of HS method is to give kj more freedom The maximum freedom for kj is that they can take any non negative P C value as long as k 1 kj 1 Like in HS method we can still train the weights using EM algorithm although the number of weights increases from C L in HS L is the depth of tree to C 2 in GHS The training algorithm is very similar to that for HS In practice in order to use all …


View Full Document

CMU CS 10701 - Hierarchical Bayesian Models for Text Classification

Documents in this Course
lecture

lecture

12 pages

lecture

lecture

17 pages

HMMs

HMMs

40 pages

lecture

lecture

15 pages

lecture

lecture

20 pages

Notes

Notes

10 pages

Notes

Notes

15 pages

Lecture

Lecture

22 pages

Lecture

Lecture

13 pages

Lecture

Lecture

24 pages

Lecture9

Lecture9

38 pages

lecture

lecture

26 pages

lecture

lecture

13 pages

Lecture

Lecture

5 pages

lecture

lecture

18 pages

lecture

lecture

22 pages

Boosting

Boosting

11 pages

lecture

lecture

16 pages

lecture

lecture

20 pages

Lecture

Lecture

20 pages

Lecture

Lecture

39 pages

Lecture

Lecture

14 pages

Lecture

Lecture

18 pages

Lecture

Lecture

13 pages

Exam

Exam

10 pages

Lecture

Lecture

27 pages

Lecture

Lecture

15 pages

Lecture

Lecture

24 pages

Lecture

Lecture

16 pages

Lecture

Lecture

23 pages

Lecture6

Lecture6

28 pages

Notes

Notes

34 pages

lecture

lecture

15 pages

Midterm

Midterm

11 pages

lecture

lecture

11 pages

lecture

lecture

23 pages

Boosting

Boosting

35 pages

Lecture

Lecture

49 pages

Lecture

Lecture

22 pages

Lecture

Lecture

16 pages

Lecture

Lecture

18 pages

Lecture

Lecture

35 pages

lecture

lecture

22 pages

lecture

lecture

24 pages

Midterm

Midterm

17 pages

exam

exam

15 pages

Lecture12

Lecture12

32 pages

lecture

lecture

19 pages

Lecture

Lecture

32 pages

boosting

boosting

11 pages

pca-mdps

pca-mdps

56 pages

bns

bns

45 pages

mdps

mdps

42 pages

svms

svms

10 pages

Notes

Notes

12 pages

lecture

lecture

42 pages

lecture

lecture

29 pages

lecture

lecture

15 pages

Lecture

Lecture

12 pages

Lecture

Lecture

24 pages

Lecture

Lecture

22 pages

Midterm

Midterm

5 pages

mdps-rl

mdps-rl

26 pages

Load more
Download Hierarchical Bayesian Models for Text Classification
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Hierarchical Bayesian Models for Text Classification and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Hierarchical Bayesian Models for Text Classification and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?