Unformatted text preview:

600.465 - Intro to NLP - J. Eisner 1 Text Categorization (actually, methods apply for categorizing anything into fixed categories – tagging, WSD, PP attachment ...) 600.465 - Intro to NLP - J. Eisner 2 Why Text Categorization?  Is it spam?  Is it Spanish?  Is it interesting to this user?  News filtering  Helpdesk routing  Is it interesting to this NLP program?  e.g., should my calendar system try to interpret this email as an appointment (using info. extraction)?  Where should it go in the directory?  Yahoo! / Open Directory / digital libraries  Which mail folder? (work, friends, junk, urgent ...) 600.465 - Intro to NLP - J. Eisner 3 Measuring Performance  Classification accuracy: What % of messages were classified correctly?  Is this what we care about? Overall accuracy Accuracy on spam Accuracy on gen System 1 95% 99.99% 90% System 2 95% 90% 99.99%  Which system do you prefer? 600.465 - Intro to NLP - J. Eisner 4 Measuring Performance  Precision = good messages kept all messages kept  Recall = good messages kept all good messages Trade off precision vs. recall by setting threshold Measure the curve on annotated dev data (or test data) Choose a threshold where user is comfortable 600.465 - Intro to NLP - J. Eisner 5 Measuring Performance low threshold: keep all the good stuff, but a lot of the bad too high threshold: all we keep is good, but we don’t keep much OK for spam filtering and legal search OK for search engines (maybe) would prefer to be here! point where precision=recall (sometimes reported) F-measure = 1 / (average(1/precision, 1/recall)) 600.465 - Intro to NLP - J. Eisner 6 More Complicated Cases of Measuring Performance  For multi-way classifiers:  Average accuracy (or precision or recall) of 2-way distinctions: Sports or not, News or not, etc.  Better, estimate the cost of different kinds of errors  e.g., how bad is each of the following?  putting Sports articles in the News section  putting Fashion articles in the News section  putting News articles in the Fashion section  Now tune system to minimize total cost  For ranking systems:  Correlate with human rankings?  Get active feedback from user?  Measure user’s wasted time by tracking clicks? Which articles are most Sports-like? Which articles / webpages most relevant?600.465 - Intro to NLP - J. Eisner 7 How to Categorize? Subject: would you like to . . . . . . drive a new vehicle for free ? ? ? this is not hype or a hoax , there are hundreds of people driving brand new cars , suvs , minivans , trucks , or rvs . it does not matter to us what type of vehicle you choose . if you qualify for our program , it is your choice of vehicle , color , and options . we don ' t care . just by driving the vehicle , you are promoting our program . if you would like to find out more about this exciting opportunity to drive a brand new vehicle for free , please go to this site : http : / / 209 . 134 . 14 . 131 / ntr to watch a short 4 minute audio / video presentation which gives you more information about our exciting new car program . if you do n't want to see the short video , but want us to send you our information package that explains our exciting opportunity for you to drive a new vehicle for free , please go here : http : / / 209 . 134 . 14 . 131 / ntr / form . htm we would like to add you the group of happy people driving a new vehicle for free . happy motoring . 600.465 - Intro to NLP - J. Eisner 8 How to Categorize? (supervised) 1. Build n-gram model of each category  Question: How to classify test message?  Answer: Bayes’ Theorem We’ve seen lots of options in this course! 600.465 - Intro to NLP - J. Eisner 9 How to Categorize? (supervised) 2. Represent each document as a vector (must choose representation and distance measure; use SVD?)  Question: How to classify test message?  Answer 1: Category whose centroid is most similar (may not work well if category is diverse)  Answer 2: Cluster each category into subcategories (then use answer 1 to pick a subcategory) (return the category that the subcategory is in) (this can also be useful for n-gram models)  Answer 3: Just look at labels of nearby training docs (e.g., let the k nearest neighbors vote – flexible!) (maybe the closer ones get a bigger vote) We’ve seen lots of options in this course! 600.465 - Intro to NLP - J. Eisner 10 How to Categorize? (supervised) 3. Treat it like word-sense disambiguation a) Vector model – use all the features (we just saw this) b) Decision list – use single most indicative feature c) Naive Bayes – use all the features, weighted by how well they discriminate among the categories d) Decision tree – use some of the features in sequence e) Other options from machine learning, like perceptron, Support Vector Machine (SVM), logistic regression, … Features matter more than which machine learning method We’ve seen lots of options in this course! 600.465 - Intro to NLP - J. Eisner 11 Review: Vector Model (0, 0, 3, 1, 0, 7, . . . 1, 0) (0, 0, 1, 0, 0, 3, . . . 0, 1) These two documents are similar: Can play lots of encoding games when creating vector: Remove function words or reduce their weight Use features other than unigrams After normalizing vector length to 1, Close in Euclidean space (similar endpoint) High dot product (similar direction) 600.465 - Intro to NLP - J. Eisner 12 Review: Decision Lists slide courtesy of D. Yarowsky (modified) To disambiguate a token of lead :  Scan down the sorted list  The first cue that is found gets to make the decision all by itself  Not as subtle as combining cues, but works well for WSD Cue’s score is its log-likelihood ratio: log [ p(cue | sense A) [smoothed] / p(cue | sense B) ]600.465 - Intro to NLP - J. Eisner 13 these stats come from term papers of known authorship (i.e., supervised training) Review: Combining Cues via Naive Bayes slide courtesy of D. Yarowsky (modified) 600.465 - Intro to NLP - J. Eisner 14 Review: Combining Cues via Naive Bayes slide courtesy of D. Yarowsky (modified) “Naive Bayes” model for classifying text (Note the naive independence assumptions!) Would this kind of sentence be more typical of a student A paper or a student B paper? 1 1 2 2 600.465 -


View Full Document

Johns Hopkins EN 600 465 - Text Categorization

Download Text Categorization
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Text Categorization and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Text Categorization 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?