DOC PREVIEW
Stanford CS 276 - Text Classification

This preview shows page 1-2-3-4-24-25-26-50-51-52-53 out of 53 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 53 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 53 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 53 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 53 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 53 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 53 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 53 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 53 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 53 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 53 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 53 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 53 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Introduction to Information Retrieval Introduction toInformation RetrievalCS276: Information Retrieval and Web SearchLecture 10: Text Classification;The Naive Bayes algorithmIntroduction to Information Retrieval Standing queriesThe path from IR to text classification:You have an information need to monitor, say:Unrest in the Niger delta regionYou want to rerun an appropriate query periodically to find new news items on this topicYou will be sent new documents that are found I.e., it’s not ranking but classification (relevant vs. not relevant)Such queries are called standing queriesLong used by “information professionals”A modern mass instantiation is Google AlertsStanding queries are (hand-written) text Ch. 13Introduction to Information Retrieval 3Introduction to Information Retrieval Spam filteringAnother text classification taskFrom: "" <[email protected]>Subject: real estate is the only way... gem oalvgkayAnyone can buy real estate with no money downStop paying rent TODAY !There is no need to spend hundreds or even thousands for similar coursesI am 22 years old and I have already purchased 6 properties using themethods outlined in this truly INCREDIBLE ebook.Change your life NOW !=================================================Click Below to order:http://www.wholesaledaily.com/sales/nmd.htmCh. 13Introduction to Information Retrieval Text classificationToday:Introduction to Text ClassificationAlso widely known as “text categorization”Same thingNaïve Bayes text classificationIncluding a little on Probabilistic Language ModelsCh. 13Introduction to Information Retrieval Categorization/ClassificationGiven:A description of an instance, d ∈ XX is the instance language or instance space.Issue: how to represent text documents. Usually some type of high-dimensional space – bag of wordsA fixed set of classes:! C = {c1, c2,…, cJ}Determine:The category of d: γ(d) ∈ C, where γ(d) is a classification function whose domain is X and whose range is C.We want to know how to build classification functions Sec. 13.1Introduction to Information Retrieval Machine Learning:Supervised ClassificationGiven:A description of an instance, d ∈ XX is the instance language or instance space.A fixed set of classes:! C = {c1, c2,…, cJ}A training set D of labeled documents with each labeled document ⟨d,c⟩ ∈ X×CDetermine:A learning method or algorithm which will enable us to learn a classifier γ:X→CFor a test document d, we assign it the class γ(d) ∈ CSec. 13.1Introduction to Information Retrieval Multimedia GUIGarb.Coll.SemanticsMLPlanningplanningtemporalreasoningplanlanguage...programmingsemanticslanguageproof...learningintelligencealgorithmreinforcementnetwork...garbagecollectionmemoryoptimizationregion...“planning language proof intelligence”TrainingData:TestData:Classes:(AI)Document Classification(Programming) (HCI)......(Note: in real life there is often a hierarchy, not present in the above problem statement; and also, you get papers on ML approaches to Garb. Coll.)Sec. 13.1Introduction to Information Retrieval More Text Classification ExamplesAssigning labels to documents or web-pages:Labels are most often topics such as Yahoo-categories"finance," "sports," "news>world>asia>business"Labels may be genres"editorials" "movie-reviews" "news”Labels may be opinion on a person/product“like”, “hate”, “neutral”Labels may be domain-specific"interesting-to-me" : "not-interesting-to-me”“contains adult language” : “doesn’t”language identification: English, French, Chinese, …search vertical: about Linux versus not“link spam” : “not link spam”Ch. 13Introduction to Information Retrieval Classification Methods (1)Manual classificationUsed by the original Yahoo! DirectoryLooksmart, about.com, ODP, PubMedVery accurate when job is done by expertsConsistent when the problem size and team is smallDifficult and expensive to scaleMeans we need automatic classification methods for big problemsCh. 13Introduction to Information Retrieval Classification Methods (2)Hand-coded rule-based classifiersOne technique used by CS dept’s spam filter, Reuters, CIA, etc.It’s what Google Alerts is doingWidely deployed in government and enterpriseCompanies provide “IDE” for writing such rulesE.g., assign category if document contains a given boolean combination of wordsCommercial systems have complex query languages (everything in IR query languages +score accumulators)Accuracy is often very high if a rule has been carefully refined over time by a subject expertCh. 13Introduction to Information Retrieval A Verity topic A complex classification ruleNote:maintenance issues (author, etc.)Hand-weighting of terms[Verity was bought by Autonomy.]Ch. 13Introduction to Information Retrieval Classification Methods (3)Supervised learning of a document-label assignment functionMany systems partly or wholly rely on machine learning (Autonomy, Microsoft, Enkata, Yahoo!, …)k-Nearest Neighbors (simple, powerful)Naive Bayes (simple, common method)Support-vector machines (new, generally more powerful)… plus many other methodsNo free lunch: requires hand-classified training dataBut data can be built up (and refined) by Ch. 13Introduction to Information Retrieval Relevance feedbackIn relevance feedback, the user marks a few documents as relevant/nonrelevantThe choices can be viewed as classes or categoriesThe IR system then uses these judgments to build a better model of the information needSo, relevance feedback can be viewed as a form of text classification (deciding between several classes)Introduction to Information Retrieval Probabilistic relevance feedbackRather than reweighting in a vector space…If user has told us some relevant and some nonrelevant documents, then we can proceed to build a probabilistic classifier such as the Naive Bayes model we will look at today:P(tk|R) = |Drk| / |Dr|P(tk|NR) = |Dnrk| / |Dnr|tk is a term; Dr is the set of known relevant documents; Drk is the subset that contain tk; Dnr is the set of known nonrelevant documents; Dnrk is the subset that contain tk.Sec. 9.1.2Introduction to Information Retrieval Bayesian MethodsLearning and classification methods


View Full Document

Stanford CS 276 - Text Classification

Documents in this Course
Load more
Download Text Classification
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Text Classification and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Text Classification 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?