DOC PREVIEW
MIT 9 520 - Text Classification

This preview shows page 1-2-3-22-23-24-45-46-47 out of 47 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 47 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 47 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 47 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 47 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 47 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 47 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 47 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 47 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 47 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 47 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Text ClassificationJason [email protected] Classification• Assign text document a label based on content.• Examples:– E-mail filtering– Knowedge-base creation– E-commerce– Question Answering– Information Extraction2E-mail Filtering• Filter e-mail into folders set up by user.• Aids searching for old e-mails• Can be used to prioritize incoming e-mails– High priority to e-mails concerning your Ph.D. thesis– Low priority to “FREE Pre-Built Home Business”3Knowedge-Base Creation• Company web sites provide large amounts of information aboutproducts, marketing contact persons, etc.• Categorization can be used to find companies’ web pages andorganize them by industrial sector.• This information can be sold to, e.g. person who wants tomarket “Flat Fixer” to tire company.4E-Commerce• Users locate products in two basic ways: search and browsing.• Browsing is best when user doesn’t know exactly what he/shewants.• Text classification can be used to organize products into ahierarchy according to description.• EBay: Classification can be used to ensure that product fitscategory given by user.5Question Answering• “When did George Washington die?”• Search document database for short strings with answer.• Rank candidates• Many features (question type, proper nouns, noun overlap,verb overlap, etc)• Problem: learn if string is the answer based on its featurevalues.6Information Extraction• Want to extract information from talk announcements (room,time, date, title, speaker, etc)• Many features may identify the information (keyword,punctuation, capitalization, numeric tokens, etc.)• Problem: scan over text of document, filling buckets withdesired information.• Freitag (1998) showed that this approach could identify speaker(63%), location (76%), start time (99%) and end time (96%).7Basics of Text Classification• Canonical Problem: Set of training documents, (d1, . . . , dn),with labels, (y1, . . . , yn). Set of test documents, (x1, . . . , xn).• Goal: Assign correct labels to test documents.8RepresentationFrom: [email protected] (Steve Dyer)Subject: Re: food-related seizures?My comments about the Feingold Diet have no relevance to yourdaughter’s purported FrostedFlakes-related seizures. I can’t imaginewhy you included it.↓food 1seizures 2diet 1catering 0religion 0......9Representation• Punctuation is removed, case is ignored, words are separatedinto tokens. Known as “feature vector” or “bag-of-words”representation.• Vector length is size of vocabulary. Common vocabulary size is10,000-100,000. Classification problem is very high dimensional.10Why is text different?• Near independence of features• High dimensionality (often larger vocabulary than # ofexamples!)• Importance of speed11Word Vector has Problems• longer document ⇒ larger vector• words tend to occur a little or a lot• rare words have same weight as common words12Text is Heavy Tailed0 1 2 3 4 5 6 7 8 910−2510−2010−1510−1010−5100Term FrequencyProbabilityDataPower lawMultinomial13SMART “ltc” Transform• new-tfi= log(tfi+ 1.0)– Corresponds to a power law distribution:p(tfi) ∝ (1 + tfi)log θ• new-wti= new-tfi∗ lognum-docsnum-docs-with-term(“TFIDF”)• norm-wti=new-wti√Pinew-wt2i(unit length vectors)14Types of Classification Problems• Binary: label each new document as positive or negative.Is this a news article Tommy would want to read?• Multiclass: give one of m labels to each new document.Which customer support group should respond to this e-mail?• Multitopic: assign zero to m topics to each new document.Who are good candidates for reviewing this research paper?• Ranking: rank categories by relevance.Help user annotate documents by suggesting good categories.15Multiclass Classification• Decision Theory: minimum errordecision boundary lies where den-sity of top two classes are equal.• Problem: Learning densities is in-effective for classification1 453216Multiclass Classification• Simple approach: construct onebinary classifier to discriminateeach class from the rest.• Problem: we can’t say anythingabout the middle regions.1 453217Multiclass Classification• Better approach: construct lots of binary classifiers that,together, approximate the true boundaries.342118Error Correcting Output Coding• Idea: Represent each label as a length l binary code. Learn onebinary classifier for each of the l bits in the code.• For each example, assign label with “closest” code.• Motivation: errors can be corrected using more bits than areneeded to partition labels.1 +1 +1 +1 +1 −1 −1 −12 +1 −1 −1 −1 +1 −1 −13 −1 +1 −1 −1 −1 +1 −14 −1 −1 +1 −1 −1 −1 +1(1)Code matrix19ECOC: The Loss Function• ECOC works best when margin values are usedˆH(x) = arg minc∈{1,...,m}lXi=1g(fi(x)Mci) (2)• The loss function (g) is a transform on the outputs:-4-3-2-101234-4 -3 -2 -1 0 1 2 3 4-4-3-2-101234-4 -3 -2 -1 0 1 2 3 4-4-3-2-101234-4 -3 -2 -1 0 1 2 3 4Hamming Hinge (SVM) Logistic20ECOC: Some Results• ECOC works better than using the usual multiclass approachfor DTs and NNs. (Dietterich and Bakiri, 1995).• Loss-based decoding works better than Hamming decodingusing SVMs (Allwein et. al., 2000).• ECOC w/ loss decoding very effective for text classification(Rennie and Rifkin, 2001).21Multiclass Classification: Interesting Questions• Is a continuous code matrix useful? (Crammer & Singer 2001)• How do you construct best code matrix? (Crammer & Singer2000) (Assumes existence of binary classifiers)22Multitopic Classification• A document may be composed ofmany different topics.• Zero or many topics per docu-ment.• “Label” is a bit vector of topic in-dicators.LondonTrafficTaxesIraqOilPoliticsBusinessMyanmar23Multitopic Classification• Basic approach: learn a binary classifier for each topic.Iraq vs. Non-IraqPolitics vs. Non-PoliticsOil vs. Non-Oil• Problem: “Iraq” doucument contains other things too.24Multitopic Classification• How to identify part of document that gives it “Iraq” topic?• Easier problem: How do we model a multi-topic document?25Multitopic Classification• If we ignore word order, each word israndomly generated from one of m topic-models.• Problem becomes: how do we learn modelfor each topic?• Ueda and Saito (2003) suggest modelingtext as a multinomial and learning the mod-els with an EM-like


View Full Document

MIT 9 520 - Text Classification

Documents in this Course
Load more
Download Text Classification
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Text Classification and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Text Classification 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?