DOC PREVIEW
Stanford CS 224 - Lecture Notes

This preview shows page 1-2-3 out of 10 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 10 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 10 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 10 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 10 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

CS224N: Final Programming ProjectThree Pandas: Laura Back & Alexei KosutJune 8, 2001AbstractThis project explores the problem of categorizing documents into predetermined groups,based on rules learned from a training set of categorized documents. Our goal was to exploretwo ways that keyword recognition can be used to define rules: by associating the presence ofa particular keyword with a specific category, and by associating a high density of a keywordwith a category. Our work explored the questions of whether one of these models performedbetter than the other and of whether a hybrid model could achieve the best performance ofall.1 IntroductionThe problem of document categorization involves teaching a machine to sort texts into apredetermined set of categories based on their content. The categories and texts should besuch that a human reader who understands the subect matter can easily do the categoriza-tion. Most work in the field, rather than attempting to instill some high-level understandingof content in the machine, looks for heuristics that allow the categorization to be done basedon simpler measures. One of the simplest measures is keyword recognition–associating wordswith particular categories based on information extracted from a training set of documents,and then using the words in a document to be categorized to decide which set of trainingdocuments it best fits into.Much of the work in keyword-based document categorization focuses on developing statisticalmodels that represent the information found in the training set and provide a frameworkfor applying information found in the unknown document. Our research took a differentangle, looking not at the statistical model for handling information but at the nature ofthe extracted information itself. A simple type of information to glean from keywords isbinary–whether a given word tends to be present in a particular category of documents.A slightly more complicated approach uses measures of word frequency within documents,matching the frequency of a word in an unknown document with its frequency in documentsof each category. Our project compares the simple binary approach with a frequency-based1approach. The frequency-based approach we use is a simplified one, recognizing words thattend to have particularly high frequency in a certain category, and choosing that categorywhen those words have high frequency in the unknown document. (We do not test anapproach which tracks the typical frequency of a word in each category and then attemptsto find the closest match, rather than just matching high frequencies with each other.)We compared these approaches with a program that tests the binary (presence-based) ap-proach against the frequency-based approach by learning rules based on one of the above.In the first case, rules say ”If word x appears in a document, that indicates that the doc-ument has category y.” In the second, they say ”If word x appears with high frequency ina document, that document has category y.” We also implemented an attempt to combinethese two types of rules by making rules of the first type for some words and of the secondtype for others, as appropriate. Our hope was that the hybrid implementation would provemore effective than the implementations that used a single strategy.2 MotivationThis work was motivated by a desire to take a new approach to keyword-based categorization,exploring the nature of the keyword usage itself. We considered the simple presence-basedapproach and wanted to explore other information that might be gleaned from keywords tohelp categorize a document.After a presence-based measure, frequency occurred to us as another obvious property ofkeywords that can be quantified, so we wondered if frequency measures could play any rolein document categorization. Intuitively, it seemed as though it should, and we thought ofseveral anecdotal cases where it might. In the case of ambiguous word sense, for instance,two meanings of a word might appear in entirely different categories of documents. It seemedlikely to us that different senses, each occurring where it was relevant, would tend to occurwith different frequencies.As an example, take the classic ”bank”: a financial institution, or the side of a river. Afinancial article may make heavy use of the term, whereas it seemed feasible to us that anature article might mention the bank of a river, but would be unlikely to focus the articlearound the bank and therefore unlikely to use the term repeatedly. Other words that seemedlikely to have such properties would be those that had both a topic-specific and a non-topicrelated meaning: for instance, the word ”general” in a military sense, and its generic useelsewhere. It seemed unlikely that the generic use would ever occur with high frequency,while the military use might in an article on a relevant topic.We chose the most simple form of frequency checker (high vs. low) rather than a morecomplicated analysis because we believed that the simple check should be sufficient to showus whether frequency measures were in fact useful; a more complicated model could then beimplemented in an attempt to improve on the original results.23 Our algorithmSince our goal was to test the type of information extracted, rather than the statistical modelused to handle the information, we tested all approaches using a simplistic statistical modelchosen primarily for its straightforward implementation. We assume that a more intelligentmodel would give improved performance for all approaches.To make the rules that we used to categorize documents, we simply calculated, from thetraining set, the probability of a word appearing in documents of a given type and itsaverage frequency when it did appear. From this we extracted rules as follows:(i) if a word appeared with high probability in documents of a certain category and lowprobability in documents of other categories, we made a presence-based rule.(ii) if a word appeared with at least moderate probability in documents of a certain categoryand had high average frequency when it appeared in that category, and low average frequencywhen it appeared in documents of other categories, we made a frequency-based rule.(iii) in the hybrid implementation, if a word was a candidate for both types of rule, we madea presence-based rule.The reasoning behind (i) should be straightforward, but (ii) and (iii) required a few designdecisions that merit explanation.First, in


View Full Document

Stanford CS 224 - Lecture Notes

Documents in this Course
Load more
Download Lecture Notes
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture Notes and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture Notes 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?