DOC PREVIEW
Stanford CS 224 - Lecture Notes

This preview shows page 1-2-3-4-5 out of 14 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 14 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 14 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 14 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 14 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 14 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 14 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

CS224N Final ProjectProbabilistic Smoothing Models for Automatic TextClassificationSep Kamvar and Carla PinonJune 6, 2001Contents1 Introduction 11.1 Practical Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Theoretical Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.3 Outline of Task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Task 22.1 Maximum Likelihood Estimate . . . . . . . . . . . . . . . . . . . . . . . . . 22.2 Bayesian Estimation with Uniform Prior . . . . . . . . . . . . . . . . . . . . 32.3 Latent Semantic Indexing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.4 Probabilistic Latent Semantic Indexing . . . . . . . . . . . . . . . . . . . . . 32.5 Second-Order Context Representation . . . . . . . . . . . . . . . . . . . . . . 43 Experimental Methods 43.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43.2 Standard Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53.3 Cross-Validation to determine Optimal Number of Factors . . . . . . . . . . 53.4 Performance Evaluation: Precision-Recall and F-measures . . . . . . . . . . 64 Results 75 Conclusions 12ii1 Introduction1.1 Practical MotivationWe seek to address the task of automatic classification of medical abstracts. Currently,the National Library of Medicine employs human taggers to read and hand-classify medicalabstracts in the Medline digital database. This costs the NLM an estimated 1 million dollarsper year. While some research is currently being done on automatically assigning MeSHterms to medical abstracts, much of the work in this area uses crude lexical or statisticalmethods, and we believe that using more sophisticated methods can substantially improveperformance.1.2 Theoretical MotivationThe fundamental problem of statistical learning theory is that of generalization performance(i.e. the error rate on test sets). For a given learning task, with a finite amount of trainingdata, the best generalization performance will be achieved if the right balance between accu-racy and capacity is achieved (where accuracy is defined as the error rate on the training set,and capacity is defined as the ability to learn any training set without error). The problemof too much capacity is often referred to as overfitting, and so the goal of machine learningcan be said to be to achieve high accuracy on training sets without overfitting. StandardNaive Bayes classifiers suffer from the problem of overfitting, which is especially detrimentalto performance in light of the sparsity of data in Natural Language applications. This hasled to the introduction of various smoothing methods for use with Naive Bayes classifiers.We argue that the poor performance of Naive Bayes classifiers in comparative evaluationsis due to the problem of overfitting and the crude nature of the probabilistic smoothingmethods used with Naive Bayes classifiers, rather than the ”naive” independence assump-tions of these classifiers. We show that, when used with sophisticated and theoreticallysound smoothing techniques, Naive Bayes classifiers can perform competitively with some ofthe best-performing pattern classification algorithms well (such as memory-based learning)for document classification tasks. We use these methods to assign MeSH terms to medicalabstracts, and evaluate their performance in comparison other methods for automaticallyassigning MeSH-terms.1.3 Outline of TaskWe extend our our work on he word sense disambiguation project by developing, imple-menting, and comparing several different smoothing methods for classification tasks. We1compare a variety of different smoothing methods, including various Bayesian Estimators(such as Laplace smoothing), generative models based on reduced-rank matrix approxima-tions (such as truncated Singular Value Decomposition and Probabilistic Latent SemanticIndexing), and more linguistically motivated smoothing methods (such as second-order con-text representation). We compare the performance of Naive Bayes Classifiers using thesesmoothed training probability distributions to variants of instance-based learning for bothword-sense disambiguation and MeSH term indexing.2 TaskText classification is one of the more difficult machine learning problems, since it deals withvery high-dimensional data sets with arbitrary patterns of missing data. Because of thisinherent sparseness of data, choosing good statistical estimators for words is extrememlyimportant to the performance of a text classification system. This problem has also beenpresented and treated under the guise of term-weighting, feature selection, and probabilisticsmoothing. Here, we explore five different statistical estimators, and evaluate them accordingto the classification performance using Naive Bayes. We explore the following five statisticalestimators for word frequencies:• Maximum Likelihood Estimate• Bayesian Estimation with a uniform prior (Laplace’s law)• Truncated SVD (Latent Semantic Indexing)• Aspect Model (Probabilistic Latent Semantic Indexing)• Second-Order Context Representation2.1 Maximum Likelihood EstimateThe maximimum likelihood estimate uses the relative frequency as a probability estimatefor each word. This is the estimate which gives the highest probability to the traiing corups.The maximum likelihood estimate is given by:PMLE(w) = C(w)/N (1)MLE makes the probability of observed events as high as possible subject to standard stochas-tic contraints. While this will work well where the data is dense, it should not work so wellfor highly sparse data.22.2 Bayesian Estimation with Uniform PriorA Bayesian Estimation with Uniform Priors, also known as Laplace’s Law, addresses thesparse data problem by assuming a uniform oruir ib events. It is given by:PLap(w) = C(w) + 1/(N + B) (2)2.3 Latent Semantic IndexingThe Singular Value Decomposition is perhaps the most informative matrix decompositionin linear algebra. The factorization is given byA = UΣVT(3)where U and V are constrained to be orthogonal matrices, and Σ is constrained to be adiagonal matrix. For every matrix A, there exists a unique SVD. The diagonal elementsof Σ are called the singular values, and they are generally ordered in decreasing numericalorder. One property of the SVD is that the truncated SVD (i.e. if we set the smallestn-k singular values=0) is the best rank k approximation to the original matrix in the


View Full Document

Stanford CS 224 - Lecture Notes

Documents in this Course
Load more
Download Lecture Notes
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture Notes and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture Notes 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?