CU-Boulder CSCI 5417 - Lecture 11 - D352952

Home> Schools> University of Colorado at Boulder> Computer Science (CSCI) > CSCI 5417> Lecture 11

DOC PREVIEW

CU-Boulder CSCI 5417 - Lecture 11

School name University of Colorado at Boulder

Course Csci 5417- Information Retrieval Systems

Pages 37

This preview shows page 1-2-17-18-19-36-37 out of 37 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 37 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 37 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 37 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 37 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 37 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 37 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 37 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 37 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

CSCI 5417 Information Retrieval Systems Jim MartinToday 9/29Where we are...Is this spam?Text Categorization ExamplesCategorization/ClassificationText Classification TypesDocument ClassificationBayesian ClassifiersNaïve Bayes ClassifiersThe Naïve Bayes Classifier (Belief Net)Learning the ModelSmoothing to Avoid OverfittingStochastic Language ModelsSlide 15Unigram and higher-order modelsNaïve Bayes via a class conditional language model = multinomial NBUsing Multinomial Naive Bayes to Classify TextNaïve Bayes: LearningMultinomial ModelNaïve Bayes: ClassifyingApply MultinomialNaive Bayes: Time ComplexityUnderflow Prevention: log spaceNaïve Bayes exampleSlide 26Slide 27New exampleEvaluating CategorizationExample: AutoYahoo!WebKB ExperimentNB Model ComparisonSlide 33SpamAssassinNaïve Bayes on spam emailNaive Bayes is Not So NaiveNext couple of classesCSCI 5417Information Retrieval SystemsJim MartinLecture 119/29/201101/13/19 CSCI 5417 - IR 2Today 9/29ClassificationNaïve Bayes classificationUnigram LM01/13/19 CSCI 5417 - IR 3Where we are...Basics of ad hoc retrievalIndexingTerm weighting/scoringCosineEvaluationDocument classificationClusteringInformation extractionSentiment/Opinion mining01/13/19 CSCI 5417 - IR 4Is this spam?From: "" <[email protected]>Subject: real estate is the only way... gem oalvgkayAnyone can buy real estate with no money downStop paying rent TODAY !There is no need to spend hundreds or even thousands for similar coursesI am 22 years old and I have already purchased 6 properties using themethods outlined in this truly INCREDIBLE ebook.Change your life NOW !=================================================Click Below to order:http://www.wholesaledaily.com/sales/nmd.htm=================================================01/13/19 CSCI 5417 - IR 5Text Categorization ExamplesAssign labels to each document or web-page:Labels are most often topics such as Yahoo-categoriesfinance, sports, news>world>asia>businessLabels may be genreseditorials, movie-reviews, newsLabels may be opinionlike, hate, neutralLabels may be domain-specific"interesting-to-me" : "not-interesting-to-me”“spam” : “not-spam”“contains adult content” :“doesn’t”important to read now: not important01/13/19 CSCI 5417 - IR 6Categorization/ClassificationGiven:A description of an instance, xX, where X is the instance language or instance space.Issue for us is how to represent text documentsAnd a fixed set of categories:C = {c1, c2,…, cn}Determine:The category of x: c(x)C, where c(x) is a categorization function whose domain is X and whose range is C.We want to know how to build categorization functions (i.e. “classifiers”).Text Classification TypesThose examples can be further classified by typeBinarySpam/not spam, contains adult content/doesn’tMultiwayBusiness vs. sports vs. gossipHierarchicalNews> UK > Wales>Weather >Mixture model.8 basketball, .2 business01/13/19 CSCI 5417 - IR 701/13/19 CSCI 5417 - IR 8Multimedia GUIGarb.Coll.SemanticsMLPlanningplanningtemporalreasoningplanlanguage...programmingsemanticslanguageproof...learningintelligencealgorithmreinforcementnetwork...garbagecollectionmemoryoptimizationregion...“planning language proof intelligence”TrainingData:TestData:Classes:(AI)Document Classification(Programming) (HCI)......01/13/19 CSCI 5417 - IR 9Bayesian ClassifiersTask: Classify a new instance D based on a tuple of attribute values into one of the classes cj  CnxxxD ,,,21K=),,,|(argmax21 njCcMAPxxxcPcjK∈=),,,()()|,,,(argmax2121njjnCcxxxPcPcxxxPjKK∈=)()|,,,(argmax21 jjnCccPcxxxPjK∈=01/13/19 CSCI 5417 - IR 10Naïve Bayes ClassifiersP(cj)Can be estimated from the frequency of classes in the training examples.P(x1,x2,…,xn|cj) O(|X|n•|C|) parametersCould only be estimated if a very, very large number of training examples was available.Naïve Bayes Conditional Independence Assumption:Assume that the probability of observing the conjunction of attributes is equal to the product of the individual probabilities P(xi|cj).01/13/19 CSCI 5417 - IR 11FluX1X2X5X3X4feversinus coughrunnynose muscle-acheThe Naïve Bayes Classifier (Belief Net)Conditional Independence Assumption: features detect term presence and are independent of each other given the class: € P(X1,K , X5| C) = P(C)P(X1| C) • P(X2| C) • L • P(X5| C)01/13/19 CSCI 5417 - IR 12Learning the ModelFirst attempt: maximum likelihood estimatessimply use the frequencies in the data)(),()|(ˆjjiijicCNcCxXNcxP====CX1X2X5X3X4X6NcCNcPjj)()(ˆ==01/13/19 CSCI 5417 - IR 13Smoothing to Avoid OverfittingkcCNcCxXNcxPjjiiji+=+===)(1),()|(ˆ# of values of XiAdd-One smoothing01/13/19 CSCI 5417 - IR 14Stochastic Language ModelsModels probability of generating strings (each word in turn) in the language (commonly all strings over ∑). E.g., unigram model 0.2 the0.1 a0.01 man0.01 woman0.03 said0.02 likes…the man likes the woman0.2 0.01 0.02 0.2 0.01multiplyModel MP(s | M) = 0.00000008 13.2.101/13/19 CSCI 5417 - IR 15Stochastic Language ModelsModel probability of generating any string0.2 the0.01 class0.0001 sayst0.0001 pleaseth0.0001 yon0.0005 maiden0.01 womanModel M1 Model M2maidenclass pleaseth yonthe0.00050.01 0.0001 0.00010.20.010.0001 0.02 0.10.2P(s|M2) > P(s|M1)0.2 the0.0001 class0.03 sayst0.02 pleaseth0.1 yon0.01 maiden0.0001 woman13.2.101/13/19 CSCI 5417 - IR 16Unigram and higher-order models Unigram Language ModelsBigram (generally, n-gram) Language ModelsOther Language ModelsGrammar-based models (PCFGs), etc.Probably not the first thing to try in IR= P ( ) P ( | ) P ( | ) P ( | ) P ( ) P ( ) P ( ) P ( ) P ( ) P ( ) P ( | ) P ( | ) P ( | )Easy.Effective!13.2.101/13/19 CSCI 5417 - IR 17Naïve Bayes via a class conditional language model = multinomial NBEffectively, the probability of each class is done as a class-specific unigram language modelCatw1w2w3w4w5w601/13/19 CSCI 5417 - IR 18Using Multinomial Naive Bayes to Classify TextAttributes are text positions, values are words.Still too many possibilitiesAssume that classification is independent of the positions of the wordsUse same parameters for each positionResult is bag of words model (over tokens not

View Full Document