CS224N Final ProjectProbabilistic Smoothing Models for Automatic TextClassificationSep Kamvar and Carla PinonJune 6, 2001Contents1 Introduction 11.1 Practical Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Theoretical Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.3 Outline of Task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Task 22.1 Maximum Likelihood Estimate . . . . . . . . . . . . . . . . . . . . . . . . . 22.2 Bayesian Estimation with Uniform Prior . . . . . . . . . . . . . . . . . . . . 32.3 Latent Semantic Indexing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.4 Probabilistic Latent Semantic Indexing . . . . . . . . . . . . . . . . . . . . . 32.5 Second-Order Context Representation . . . . . . . . . . . . . . . . . . . . . . 43 Experimental Methods 43.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43.2 Standard Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53.3 Cross-Validation to determine Optimal Number of Factors . . . . . . . . . . 53.4 Performance Evaluation: Precision-Recall and F-measures . . . . . . . . . . 64 Results 75 Conclusions 12ii1 Introduction1.1 Practical MotivationWe seek to address the task of automatic classification of medical abstracts. Currently,the National Library of Medicine employs human taggers to read and hand-classify medicalabstracts in the Medline digital database. This costs the NLM an estimated 1 million dollarsper year. While some research is currently being done on automatically assigning MeSHterms to medical abstracts, much of the work in this area uses crude lexical or statisticalmethods, and we believe that using more sophisticated methods can substantially improveperformance.1.2 Theoretical MotivationThe fundamental problem of statistical learning theory is that of generalization performance(i.e. the error rate on test sets). For a given learning task, with a finite amount of trainingdata, the best generalization performance will be achieved if the right balance between accu-racy and capacity is achieved (where accuracy is defined as the error rate on the training set,and capacity is defined as the ability to learn any training set without error). The problemof too much capacity is often referred to as overfitting, and so the goal of machine learningcan be said to be to achieve high accuracy on training sets without overfitting. StandardNaive Bayes classifiers suffer from the problem of overfitting, which is especially detrimentalto performance in light of the sparsity of data in Natural Language applications. This hasled to the introduction of various smoothing methods for use with Naive Bayes classifiers.We argue that the poor performance of Naive Bayes classifiers in comparative evaluationsis due to the problem of overfitting and the crude nature of the probabilistic smoothingmethods used with Naive Bayes classifiers, rather than the ”naive” independence assump-tions of these classifiers. We show that, when used with sophisticated and theoreticallysound smoothing techniques, Naive Bayes classifiers can perform competitively with some ofthe best-performing pattern classification algorithms well (such as memory-based learning)for document classification tasks. We use these methods to assign MeSH terms to medicalabstracts, and evaluate their performance in comparison other methods for automaticallyassigning MeSH-terms.1.3 Outline of TaskWe extend our our work on he word sense disambiguation project by developing, imple-menting, and comparing several different smoothing methods for classification tasks. We1compare a variety of different smoothing methods, including various Bayesian Estimators(such as Laplace smoothing), generative models based on reduced-rank matrix approxima-tions (such as truncated Singular Value Decomposition and Probabilistic Latent SemanticIndexing), and more linguistically motivated smoothing methods (such as second-order con-text representation). We compare the performance of Naive Bayes Classifiers using thesesmoothed training probability distributions to variants of instance-based learning for bothword-sense disambiguation and MeSH term indexing.2 TaskText classification is one of the more difficult machine learning problems, since it deals withvery high-dimensional data sets with arbitrary patterns of missing data. Because of thisinherent sparseness of data, choosing good statistical estimators for words is extrememlyimportant to the performance of a text classification system. This problem has also beenpresented and treated under the guise of term-weighting, feature selection, and probabilisticsmoothing. Here, we explore five different statistical estimators, and evaluate them accordingto the classification performance using Naive Bayes. We explore the following five statisticalestimators for word frequencies:• Maximum Likelihood Estimate• Bayesian Estimation with a uniform prior (Laplace’s law)• Truncated SVD (Latent Semantic Indexing)• Aspect Model (Probabilistic Latent Semantic Indexing)• Second-Order Context Representation2.1 Maximum Likelihood EstimateThe maximimum likelihood estimate uses the relative frequency as a probability estimatefor each word. This is the estimate which gives the highest probability to the traiing corups.The maximum likelihood estimate is given by:PMLE(w) = C(w)/N (1)MLE makes the probability of observed events as high as possible subject to standard stochas-tic contraints. While this will work well where the data is dense, it should not work so wellfor highly sparse data.22.2 Bayesian Estimation with Uniform PriorA Bayesian Estimation with Uniform Priors, also known as Laplace’s Law, addresses thesparse data problem by assuming a uniform oruir ib events. It is given by:PLap(w) = C(w) + 1/(N + B) (2)2.3 Latent Semantic IndexingThe Singular Value Decomposition is perhaps the most informative matrix decompositionin linear algebra. The factorization is given byA = UΣVT(3)where U and V are constrained to be orthogonal matrices, and Σ is constrained to be adiagonal matrix. For every matrix A, there exists a unique SVD. The diagonal elementsof Σ are called the singular values, and they are generally ordered in decreasing numericalorder. One property of the SVD is that the truncated SVD (i.e. if we set the smallestn-k singular values=0) is the best rank k approximation to the original matrix in the
View Full Document