Unformatted text preview:

Text CategorizationWhy Text Categorization?Measuring PerformanceSlide 4Slide 5More Complicated Cases of Measuring PerformanceHow to Categorize?How to Categorize? (supervised)Slide 9Slide 10Review: Vector ModelReview: Decision ListsReview: Combining Cues via Naive BayesSlide 14Decision TreesFeatures Besides UnigramsSpamAssassin FeaturesSlide 18Slide 19Slide 20Slide 21Slide 22Slide 23Slide 24Slide 25Slide 26Slide 27Slide 28Slide 29Slide 30Slide 31Slide 32Slide 33Slide 34Slide 35Slide 36Slide 37Slide 38Slide 39Slide 40Slide 41Slide 42Slide 43Slide 44Slide 45Slide 46Slide 47Slide 48Slide 49Slide 50Slide 51Slide 52Slide 53Slide 54Slide 55Slide 56Slide 57Slide 58Slide 59Slide 60Slide 61Slide 62Slide 63Slide 64Slide 65Slide 66How to Categorize? (unsupervised)How to Categorize? (semisupervised)How to Categorize? (adaptive)How to Categorize? (hierarchical)600.465 - Intro to NLP - J. Eisner 1Text Categorization(actually, methods apply for categorizing anything into fixed categories – tagging, WSD, PP attachment ...)600.465 - Intro to NLP - J. Eisner 2Why Text Categorization?Is it spam?Is it Spanish?Is it interesting to this user?News filteringHelpdesk routingIs it interesting to this NLP program?e.g., should my calendar system try to interpret this email as an appointment (using info. extraction)?Where should it go in the directory?Yahoo! / Open Directory / digital librariesWhich mail folder? (work, friends, junk, urgent ...)600.465 - Intro to NLP - J. Eisner 3Measuring PerformanceClassification accuracy: What % of messages were classified correctly?Is this what we care about?Overall accuracyAccuracy on spamAccuracy on genSystem 195% 99.99% 90%System 295% 90% 99.99%Which system do you prefer?600.465 - Intro to NLP - J. Eisner 4Measuring PerformancePrecision = good messages kept all messages keptRecall =good messages kept all good messagesTrade off precision vs. recall by setting thresholdMeasure the curve on annotated dev data (or test data)Choose a threshold where user is comfortablePrecision vs. Recall of Good (non-spam) Email0%25%50%75%100%0% 25% 50% 75% 100%RecallPrecision600.465 - Intro to NLP - J. Eisner 5Precision vs. Recall of Good (non-spam) Email0%25%50%75%100%0% 25% 50% 75% 100%RecallPrecisionMeasuring Performancelow threshold:keep all the good stuff,but a lot of the bad toohigh threshold:all we keep is good,but we don’t keep muchOK for spam filtering and legal searchOK for search engines (maybe)would prefer to be here!point whereprecision=recall(sometimes reported)F-measure = 1 / (average(1/precision, 1/recall))600.465 - Intro to NLP - J. Eisner 6More Complicated Cases of Measuring PerformanceFor multi-way classifiers:Average accuracy (or precision or recall) of 2-way distinctions: Sports or not, News or not, etc.Better, estimate the cost of different kinds of errorse.g., how bad is each of the following?putting Sports articles in the News sectionputting Fashion articles in the News sectionputting News articles in the Fashion sectionNow tune system to minimize total costFor ranking systems:Correlate with human rankings?Get active feedback from user?Measure user’s wasted time by tracking clicks?Which articles are most Sports-like?Which articles / webpages most relevant?600.465 - Intro to NLP - J. Eisner 7How to Categorize?Subject: would you like to . . . .. . drive a new vehicle for free ? ? ? this is not hype or a hoax , there are hundreds of people driving brand new cars , suvs , minivans , trucks , or rvs . it does not matter to us what type of vehicle you choose . if you qualify for our program , it is your choice of vehicle , color , and options . we don ' t care . just by driving the vehicle , you are promoting our program . if you would like to find out more about this exciting opportunity to drive a brand new vehicle for free , please go to this site : http : / / 209 . 134 . 14 . 131 / ntr to watch a short 4 minute audio / video presentation which gives you more information about our exciting new car program . if you do n't want to see the short video , but want us to send you our information package that explains our exciting opportunity for you to drive a new vehicle for free , please go here : http : / / 209 . 134 . 14 . 131 / ntr / form . htm we would like to add you the group of happy people driving a new vehicle for free . happy motoring .600.465 - Intro to NLP - J. Eisner 8How to Categorize? (supervised) 1. Build n-gram model of each categoryQuestion: How to classify test message? Answer: Bayes’ TheoremWe’ve seen lots of options in this course!600.465 - Intro to NLP - J. Eisner 9How to Categorize? (supervised) 2. Represent each document as a vector(must choose representation and distance measure; use SVD?)Question: How to classify test message?Answer 1: Category whose centroid is most similar(may not work well if category is diverse)Answer 2: Cluster each category into subcategories(then use answer 1 to pick a subcategory)(return the category that the subcategory is in)(this can also be useful for n-gram models)Answer 3: Just look at labels of nearby training docs(e.g., let the k nearest neighbors vote – flexible!)(maybe the closer ones get a bigger vote)We’ve seen lots of options in this course!600.465 - Intro to NLP - J. Eisner 10How to Categorize? (supervised) 3. Treat it like word-sense disambiguationa) Vector model – use all the features (we just saw this)b) Decision list – use single most indicative featurec) Naive Bayes – use all the features, weighted by how well they discriminate among the categoriesd) Decision tree – use some of the features in sequencee) Other options from machine learning, like perceptron, Support Vector Machine (SVM), logistic regression, …Features matter more than which machine learning methodWe’ve seen lots of options in this course!600.465 - Intro to NLP - J. Eisner 11Review: Vector Model(0, 0, 3, 1, 0, 7, . . . 1, 0) aardvarkabacusabbotabductabovezygotezymurgyabandoned(0, 0, 1, 0, 0, 3, . . . 0, 1) These two documents are similar:Can play lots of encoding games when creating vector:Remove function words or reduce their weightUse features other than unigrams After normalizing vector length to 1,Close in Euclidean space (similar endpoint)High dot product (similar direction)600.465 - Intro to NLP - J. Eisner 12Review: Decision Listsslide courtesy of D. Yarowsky


View Full Document

Johns Hopkins EN 600 465 - Text Categorization

Download Text Categorization
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Text Categorization and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Text Categorization 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?