Introduction to Information Retrieval Introduction toInformation RetrievalCS276: Information Retrieval and Web SearchLecture 10: Text Classification;The Naive Bayes algorithmIntroduction to Information Retrieval Standing queriesThe path from IR to text classification:You have an information need to monitor, say:Unrest in the Niger delta regionYou want to rerun an appropriate query periodically to find new news items on this topicYou will be sent new documents that are found I.e., it’s not ranking but classification (relevant vs. not relevant)Such queries are called standing queriesLong used by “information professionals”A modern mass instantiation is Google AlertsStanding queries are (hand-written) text Ch. 13Introduction to Information Retrieval 3Introduction to Information Retrieval Spam filteringAnother text classification taskFrom: "" <[email protected]>Subject: real estate is the only way... gem oalvgkayAnyone can buy real estate with no money downStop paying rent TODAY !There is no need to spend hundreds or even thousands for similar coursesI am 22 years old and I have already purchased 6 properties using themethods outlined in this truly INCREDIBLE ebook.Change your life NOW !=================================================Click Below to order:http://www.wholesaledaily.com/sales/nmd.htmCh. 13Introduction to Information Retrieval Text classificationToday:Introduction to Text ClassificationAlso widely known as “text categorization”Same thingNaïve Bayes text classificationIncluding a little on Probabilistic Language ModelsCh. 13Introduction to Information Retrieval Categorization/ClassificationGiven:A description of an instance, d ∈ XX is the instance language or instance space.Issue: how to represent text documents. Usually some type of high-dimensional space – bag of wordsA fixed set of classes:! C = {c1, c2,…, cJ}Determine:The category of d: γ(d) ∈ C, where γ(d) is a classification function whose domain is X and whose range is C.We want to know how to build classification functions Sec. 13.1Introduction to Information Retrieval Machine Learning:Supervised ClassificationGiven:A description of an instance, d ∈ XX is the instance language or instance space.A fixed set of classes:! C = {c1, c2,…, cJ}A training set D of labeled documents with each labeled document ⟨d,c⟩ ∈ X×CDetermine:A learning method or algorithm which will enable us to learn a classifier γ:X→CFor a test document d, we assign it the class γ(d) ∈ CSec. 13.1Introduction to Information Retrieval Multimedia GUIGarb.Coll.SemanticsMLPlanningplanningtemporalreasoningplanlanguage...programmingsemanticslanguageproof...learningintelligencealgorithmreinforcementnetwork...garbagecollectionmemoryoptimizationregion...“planning language proof intelligence”TrainingData:TestData:Classes:(AI)Document Classification(Programming) (HCI)......(Note: in real life there is often a hierarchy, not present in the above problem statement; and also, you get papers on ML approaches to Garb. Coll.)Sec. 13.1Introduction to Information Retrieval More Text Classification ExamplesAssigning labels to documents or web-pages:Labels are most often topics such as Yahoo-categories"finance," "sports," "news>world>asia>business"Labels may be genres"editorials" "movie-reviews" "news”Labels may be opinion on a person/product“like”, “hate”, “neutral”Labels may be domain-specific"interesting-to-me" : "not-interesting-to-me”“contains adult language” : “doesn’t”language identification: English, French, Chinese, …search vertical: about Linux versus not“link spam” : “not link spam”Ch. 13Introduction to Information Retrieval Classification Methods (1)Manual classificationUsed by the original Yahoo! DirectoryLooksmart, about.com, ODP, PubMedVery accurate when job is done by expertsConsistent when the problem size and team is smallDifficult and expensive to scaleMeans we need automatic classification methods for big problemsCh. 13Introduction to Information Retrieval Classification Methods (2)Hand-coded rule-based classifiersOne technique used by CS dept’s spam filter, Reuters, CIA, etc.It’s what Google Alerts is doingWidely deployed in government and enterpriseCompanies provide “IDE” for writing such rulesE.g., assign category if document contains a given boolean combination of wordsCommercial systems have complex query languages (everything in IR query languages +score accumulators)Accuracy is often very high if a rule has been carefully refined over time by a subject expertCh. 13Introduction to Information Retrieval A Verity topic A complex classification ruleNote:maintenance issues (author, etc.)Hand-weighting of terms[Verity was bought by Autonomy.]Ch. 13Introduction to Information Retrieval Classification Methods (3)Supervised learning of a document-label assignment functionMany systems partly or wholly rely on machine learning (Autonomy, Microsoft, Enkata, Yahoo!, …)k-Nearest Neighbors (simple, powerful)Naive Bayes (simple, common method)Support-vector machines (new, generally more powerful)… plus many other methodsNo free lunch: requires hand-classified training dataBut data can be built up (and refined) by Ch. 13Introduction to Information Retrieval Relevance feedbackIn relevance feedback, the user marks a few documents as relevant/nonrelevantThe choices can be viewed as classes or categoriesThe IR system then uses these judgments to build a better model of the information needSo, relevance feedback can be viewed as a form of text classification (deciding between several classes)Introduction to Information Retrieval Probabilistic relevance feedbackRather than reweighting in a vector space…If user has told us some relevant and some nonrelevant documents, then we can proceed to build a probabilistic classifier such as the Naive Bayes model we will look at today:P(tk|R) = |Drk| / |Dr|P(tk|NR) = |Dnrk| / |Dnr|tk is a term; Dr is the set of known relevant documents; Drk is the subset that contain tk; Dnr is the set of known nonrelevant documents; Dnrk is the subset that contain tk.Sec. 9.1.2Introduction to Information Retrieval Bayesian MethodsLearning and classification methods
View Full Document