CSCI 5417 Information Retrieval Systems Jim MartinToday 9/29Where we are...Is this spam?Text Categorization ExamplesCategorization/ClassificationText Classification TypesDocument ClassificationBayesian ClassifiersNaïve Bayes ClassifiersThe Naïve Bayes Classifier (Belief Net)Learning the ModelSmoothing to Avoid OverfittingStochastic Language ModelsSlide 15Unigram and higher-order modelsNaïve Bayes via a class conditional language model = multinomial NBUsing Multinomial Naive Bayes to Classify TextNaïve Bayes: LearningMultinomial ModelNaïve Bayes: ClassifyingApply MultinomialNaive Bayes: Time ComplexityUnderflow Prevention: log spaceNaïve Bayes exampleSlide 26Slide 27New exampleEvaluating CategorizationExample: AutoYahoo!WebKB ExperimentNB Model ComparisonSlide 33SpamAssassinNaïve Bayes on spam emailNaive Bayes is Not So NaiveNext couple of classesCSCI 5417Information Retrieval SystemsJim MartinLecture 119/29/201101/13/19 CSCI 5417 - IR 2Today 9/29ClassificationNaïve Bayes classificationUnigram LM01/13/19 CSCI 5417 - IR 3Where we are...Basics of ad hoc retrievalIndexingTerm weighting/scoringCosineEvaluationDocument classificationClusteringInformation extractionSentiment/Opinion mining01/13/19 CSCI 5417 - IR 4Is this spam?From: "" <[email protected]>Subject: real estate is the only way... gem oalvgkayAnyone can buy real estate with no money downStop paying rent TODAY !There is no need to spend hundreds or even thousands for similar coursesI am 22 years old and I have already purchased 6 properties using themethods outlined in this truly INCREDIBLE ebook.Change your life NOW !=================================================Click Below to order:http://www.wholesaledaily.com/sales/nmd.htm=================================================01/13/19 CSCI 5417 - IR 5Text Categorization ExamplesAssign labels to each document or web-page:Labels are most often topics such as Yahoo-categoriesfinance, sports, news>world>asia>businessLabels may be genreseditorials, movie-reviews, newsLabels may be opinionlike, hate, neutralLabels may be domain-specific"interesting-to-me" : "not-interesting-to-me”“spam” : “not-spam”“contains adult content” :“doesn’t”important to read now: not important01/13/19 CSCI 5417 - IR 6Categorization/ClassificationGiven:A description of an instance, xX, where X is the instance language or instance space.Issue for us is how to represent text documentsAnd a fixed set of categories:C = {c1, c2,…, cn}Determine:The category of x: c(x)C, where c(x) is a categorization function whose domain is X and whose range is C.We want to know how to build categorization functions (i.e. “classifiers”).Text Classification TypesThose examples can be further classified by typeBinarySpam/not spam, contains adult content/doesn’tMultiwayBusiness vs. sports vs. gossipHierarchicalNews> UK > Wales>Weather >Mixture model.8 basketball, .2 business01/13/19 CSCI 5417 - IR 701/13/19 CSCI 5417 - IR 8Multimedia GUIGarb.Coll.SemanticsMLPlanningplanningtemporalreasoningplanlanguage...programmingsemanticslanguageproof...learningintelligencealgorithmreinforcementnetwork...garbagecollectionmemoryoptimizationregion...“planning language proof intelligence”TrainingData:TestData:Classes:(AI)Document Classification(Programming) (HCI)......01/13/19 CSCI 5417 - IR 9Bayesian ClassifiersTask: Classify a new instance D based on a tuple of attribute values into one of the classes cj CnxxxD ,,,21K=),,,|(argmax21 njCcMAPxxxcPcjK∈=),,,()()|,,,(argmax2121njjnCcxxxPcPcxxxPjKK∈=)()|,,,(argmax21 jjnCccPcxxxPjK∈=01/13/19 CSCI 5417 - IR 10Naïve Bayes ClassifiersP(cj)Can be estimated from the frequency of classes in the training examples.P(x1,x2,…,xn|cj) O(|X|n•|C|) parametersCould only be estimated if a very, very large number of training examples was available.Naïve Bayes Conditional Independence Assumption:Assume that the probability of observing the conjunction of attributes is equal to the product of the individual probabilities P(xi|cj).01/13/19 CSCI 5417 - IR 11FluX1X2X5X3X4feversinus coughrunnynose muscle-acheThe Naïve Bayes Classifier (Belief Net)Conditional Independence Assumption: features detect term presence and are independent of each other given the class: € P(X1,K , X5| C) = P(C)P(X1| C) • P(X2| C) • L • P(X5| C)01/13/19 CSCI 5417 - IR 12Learning the ModelFirst attempt: maximum likelihood estimatessimply use the frequencies in the data)(),()|(ˆjjiijicCNcCxXNcxP====CX1X2X5X3X4X6NcCNcPjj)()(ˆ==01/13/19 CSCI 5417 - IR 13Smoothing to Avoid OverfittingkcCNcCxXNcxPjjiiji+=+===)(1),()|(ˆ# of values of XiAdd-One smoothing01/13/19 CSCI 5417 - IR 14Stochastic Language ModelsModels probability of generating strings (each word in turn) in the language (commonly all strings over ∑). E.g., unigram model 0.2 the0.1 a0.01 man0.01 woman0.03 said0.02 likes…the man likes the woman0.2 0.01 0.02 0.2 0.01multiplyModel MP(s | M) = 0.00000008 13.2.101/13/19 CSCI 5417 - IR 15Stochastic Language ModelsModel probability of generating any string0.2 the0.01 class0.0001 sayst0.0001 pleaseth0.0001 yon0.0005 maiden0.01 womanModel M1 Model M2maidenclass pleaseth yonthe0.00050.01 0.0001 0.00010.20.010.0001 0.02 0.10.2P(s|M2) > P(s|M1)0.2 the0.0001 class0.03 sayst0.02 pleaseth0.1 yon0.01 maiden0.0001 woman13.2.101/13/19 CSCI 5417 - IR 16Unigram and higher-order models Unigram Language ModelsBigram (generally, n-gram) Language ModelsOther Language ModelsGrammar-based models (PCFGs), etc.Probably not the first thing to try in IR= P ( ) P ( | ) P ( | ) P ( | ) P ( ) P ( ) P ( ) P ( ) P ( ) P ( ) P ( | ) P ( | ) P ( | )Easy.Effective!13.2.101/13/19 CSCI 5417 - IR 17Naïve Bayes via a class conditional language model = multinomial NBEffectively, the probability of each class is done as a class-specific unigram language modelCatw1w2w3w4w5w601/13/19 CSCI 5417 - IR 18Using Multinomial Naive Bayes to Classify TextAttributes are text positions, values are words.Still too many possibilitiesAssume that classification is independent of the positions of the wordsUse same parameters for each positionResult is bag of words model (over tokens not
View Full Document