CS276B Text Information Retrieval, Mining, and ExploitationIs this spam?Categorization/ClassificationDocument ClassificationText Categorization ExamplesMethods (1)Methods (2)Text Categorization: attributesBayesian MethodsBayes’ RuleMaximum a posteriori HypothesisMaximum likelihood HypothesisNaive Bayes ClassifiersNaïve Bayes Classifier: AssumptionsThe Naïve Bayes ClassifierLearning the ModelProblem with Max LikelihoodSmoothing to Avoid OverfittingUsing Naive Bayes Classifiers to Classify Text: Basic methodText Classification Algorithms: LearningText Classification Algorithms: ClassifyingNaive Bayes Time ComplexityUnderflow PreventionNaïve Bayes Posterior ProbabilitiesTwo ModelsSlide 26Parameter estimationFeature selection via Mutual InformationFeature selection via MI (contd.)Evaluating CategorizationExample: AutoYahoo!Example: WebKB (CMU)WebKB ExperimentNB Model ComparisonPowerPoint PresentationSample Learning Curve (Yahoo Science Data)Importance of Conditional IndependenceConditions for Optimality of Naive BayesNaive Bayes is Not So NaiveInterpretability of Naive BayesNaive Bayes DrawbacksFinal example: Text classification vs. information extractionNaive integration of IE & TC‘Change of Address’ emailKushmerick: CoA ResultsResourcesCS276BText Information Retrieval, Mining, and ExploitationLecture 4Text Categorization IIntroduction and Naive BayesJan 21, 2003Is this spam?From: "" <[email protected]>Subject: real estate is the only way... gem oalvgkayAnyone can buy real estate with no money downStop paying rent TODAY !There is no need to spend hundreds or even thousands for similar coursesI am 22 years old and I have already purchased 6 properties using themethods outlined in this truly INCREDIBLE ebook.Change your life NOW !=================================================Click Below to order:http://www.wholesaledaily.com/sales/nmd.htm=================================================Categorization/ClassificationGiven:A description of an instance, xX, where X is the instance language or instance space.Issue: how to represent text documents.A fixed set of categories:C = {c1, c2,…, cn}Determine:The category of x: c(x)C, where c(x) is a categorization function whose domain is X and whose range is C.We want to know how to build categorization functions (“classifiers”).Multimedia GUIGarb.Coll.SemanticsMLPlanningplanningtemporalreasoningplanlanguage...programmingsemanticslanguageproof...learningintelligencealgorithmreinforcementnetwork...garbagecollectionmemoryoptimizationregion...“planning language proof intelligence”TrainingData:TestingData:Classes:(AI)Document Classification(Programming) (HCI)......(Note: in real life there is often a hierarchy, not present in the above problem statement; and you get papers on ML approaches to Garb. Coll.)Text Categorization ExamplesAssign labels to each document or web-page:Labels are most often topics such as Yahoo-categoriese.g., "finance," "sports," "news>world>asia>business"Labels may be genrese.g., "editorials" "movie-reviews" "news“Labels may be opinione.g., “like”, “hate”, “neutral”Labels may be domain-specific binarye.g., "interesting-to-me" : "not-interesting-to-me”e.g., “spam” : “not-spam”e.g., “is a toner cartridge ad” :“isn’t”Methods (1)Manual classificationUsed by Yahoo!, Looksmart, about.com, ODP, Medlinevery accurate when job is done by expertsconsistent when the problem size and team is smalldifficult and expensive to scaleAutomatic document classificationHand-coded rule-based systemsUsed by CS dept’s spam filter, Reuters, CIA, Verity, …E.g., assign category if document contains a given boolean combination of wordsCommercial systems have complex query languages (everything in IR query languages + accumulators)Methods (2)Accuracy is often very high if a query has been carefully refined over time by a subject expertBuilding and maintaining these queries is expensiveSupervised learning of document-label assignment functionMany new systems rely on machine learning (Autonomy, Kana, MSN, Verity, …)k-Nearest Neighbors (simple, powerful)Naive Bayes (simple, common method)Support-vector machines (new, more powerful)… plus many other methodsNo free lunch: requires hand-classified training dataBut can be built (and refined) by amateursText Categorization: attributesRepresentations of text are very high dimensional (one feature for each word).High-bias algorithms that prevent overfitting in high-dimensional space are best.For most text categorization tasks, there are many irrelevant and many relevant features.Methods that combine evidence from many or all features (e.g. naive Bayes, kNN, neural-nets) tend to work better than ones that try to isolate just a few relevant features (standard decision-tree or rule induction)**Although one can compensate by using many rulesBayesian MethodsOur focus todayLearning and classification methods based on probability theory.Bayes theorem plays a critical role in probabilistic learning and classification.Build a generative model that approximates how data is producedUses prior probability of each category given no information about an item.Categorization produces a posterior probability distribution over the possible categories given a description of an item.Bayes’ Rule)()|()()|(),( CPCXPXPXCPXCP )()()|()|(XPCPCXPXCP Maximum a posteriori Hypothesis)|(argmax DhPhHhMAP)()()|(argmaxDPhPhDPhHhMAP)()|(argmax hPhDPhHhMAPMaximum likelihood HypothesisIf all hypotheses are a priori equally likely, we only need to consider the P(D|h) term:)|(argmax hDPhHhMLNaive Bayes ClassifiersTask: Classify a new instance based on a tuple of attribute valuesnxxx ,,,21),,,|(argmax21 njCcMAPxxxcPcj),,,()()|,,,(argmax2121njjnCcMAPcccPcPcxxxPcj)()|,,,(argmax21 jjnCcMAPcPcxxxPcjNaïve Bayes Classifier: AssumptionsP(cj)Can be estimated from the frequency of classes in the training examples.P(x1,x2,…,xn|cj) O(|X|n•|C|)Could only be estimated if a very, very large number of training examples was available.Conditional Independence Assumption: Assume that the probability of observing the conjunction of attributes is equal to the product of the individual probabilities.FluX1X2X5X3X4feversinus coughrunnynose muscle-acheThe Naïve Bayes
View Full Document