Stanford CS 276 - Introduction to Information Retrieval - D980095

Home> Schools> Stanford University> Computer Science (CS) > CS 276> Introduction to Information Retrieval

DOC PREVIEW

Stanford CS 276 - Introduction to Information Retrieval

School name Stanford University

Course Cs 276- Information Retrieval and Web Search

Pages 46

This preview shows page 1-2-3-22-23-24-44-45-46 out of 46 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 46 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 46 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 46 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 46 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 46 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 46 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 46 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 46 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 46 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 46 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Slide 1Recap: Naïve Bayes classifiersThe rest of text classificationRecall: Vector Space RepresentationClassification Using Vector SpacesDocuments in a Vector SpaceTest Document of what class?Test Document = GovernmentAside: 2D/3D graphs can be misleadingUsing Rocchio for text classificationIllustration of Rocchio Text CategorizationDefinition of centroidRocchio PropertiesRocchio AnomalyRocchio classificationk Nearest Neighbor ClassificationExample: k=6 (6NN)Nearest-Neighbor Learning AlgorithmkNN Is Close to Optimalk Nearest NeighborkNN decision boundariesSimilarity MetricsIllustration of 3 Nearest Neighbor for Text Vector Space3 Nearest Neighbor vs. RocchioNearest Neighbor with Inverted IndexkNN: DiscussionkNN vs. Naive BayesBias vs. variance: Choosing the correct model capacityLinear classifiers and binary and multiclass classificationSeparation by HyperplanesLinear programming / PerceptronWhich Hyperplane?Slide 33Linear classifier: ExampleLinear ClassifiersRocchio is a linear classifierTwo-class Rocchio as a linear classifierNaive Bayes is a linear classifierA nonlinear problemHigh Dimensional DataMore Than Two ClassesSet of Binary Classifiers: Any ofSet of Binary Classifiers: One ofSummary: Representation of Text Categorization AttributesWhich classifier do I use for a given text classification problem?Resources for today’s lectureIntroduction to Information RetrievalIntroduction to Information Retrieval Introduction toInformation RetrievalCS276: Information Retrieval and Web SearchPandu Nayak and Prabhakar RaghavanLecture 11: Text Classification;Vector space classification[Borrows slides from Ray Mooney]Introduction to Information RetrievalIntroduction to Information Retrieval Recap: Naïve Bayes classifiersClassify based on prior weight of class and conditional parameter for what each word says:Training is done by counting and dividing:Don’t forget to smooth2€ cNB= argmaxcj∈Clog P(cj) + log P(xi| cj)i∈positions∑ ⎡ ⎣ ⎢ ⎢ ⎤ ⎦ ⎥ ⎥€ P(cj) ←NcjN€ P(xk| cj) ←Tcjxk+ α[Tcjxi+ α ]xi∈V∑Introduction to Information RetrievalIntroduction to Information Retrieval 3The rest of text classificationToday:Vector space methods for Text ClassificationVector space classification using centroids (Rocchio)K Nearest NeighborsDecision boundaries, linear and nonlinear classifiersDealing with more than 2 classesLater in the courseMore text classificationSupport Vector MachinesText-specific issues in classificationIntroduction to Information RetrievalIntroduction to Information Retrieval 4Recall: Vector Space RepresentationEach document is a vector, one component for each term (= word).Normally normalize vectors to unit length.High-dimensional vector space:Terms are axes10,000+ dimensions, or even 100,000+Docs are vectors in this spaceHow can we do classification in this space?Sec.14.1Introduction to Information RetrievalIntroduction to Information Retrieval 5Classification Using Vector SpacesAs before, the training set is a set of documents, each labeled with its class (e.g., topic)In vector space classification, this set corresponds to a labeled set of points (or, equivalently, vectors) in the vector spacePremise 1: Documents in the same class form a contiguous region of spacePremise 2: Documents from different classes don’t overlap (much)We define surfaces to delineate classes in the spaceSec.14.1Introduction to Information RetrievalIntroduction to Information Retrieval 6Documents in a Vector SpaceGovernmentScienceArtsSec.14.1Introduction to Information RetrievalIntroduction to Information Retrieval 7Test Document of what class?GovernmentScienceArtsSec.14.1Introduction to Information RetrievalIntroduction to Information Retrieval 8Test Document = GovernmentGovernmentScienceArtsIs this similarityhypothesistrue ingeneral?Our main topic today is how to find good separatorsSec.14.1Introduction to Information RetrievalIntroduction to Information Retrieval Aside: 2D/3D graphs can be misleading9Sec.14.1Introduction to Information RetrievalIntroduction to Information Retrieval Using Rocchio for text classificationRelevance feedback methods can be adapted for text categorizationAs noted before, relevance feedback can be viewed as 2-class classificationRelevant vs. nonrelevant documentsUse standard tf-idf weighted vectors to represent text documents For training documents in each category, compute a prototype vector by summing the vectors of the training documents in the category.Prototype = centroid of members of classAssign test documents to the category with the closest prototype vector based on cosine similarity.10Sec.14.2Introduction to Information RetrievalIntroduction to Information Retrieval Illustration of Rocchio Text Categorization11Sec.14.2Introduction to Information RetrievalIntroduction to Information Retrieval Definition of centroidWhere Dc is the set of all documents that belong to class c and v(d) is the vector space representation of d.Note that centroid will in general not be a unit vector even when the inputs are unit vectors.12 € r μ (c) =1| Dc|r v (d)d ∈Dc∑Sec.14.2Introduction to Information RetrievalIntroduction to Information Retrieval 13Rocchio Properties Forms a simple generalization of the examples in each class (a prototype).Prototype vector does not need to be averaged or otherwise normalized for length since cosine similarity is insensitive to vector length.Classification is based on similarity to class prototypes.Does not guarantee classifications are consistent with the given training data. Why not?Sec.14.2Introduction to Information RetrievalIntroduction to Information Retrieval 14Rocchio Anomaly Prototype models have problems with polymorphic (disjunctive) categories.Sec.14.2Introduction to Information RetrievalIntroduction to Information Retrieval Rocchio classificationRocchio forms a simple representation for each class: the centroid/prototype Classification is based on similarity to / distance from the prototype/centroidIt does not guarantee that classifications are consistent with the given training dataIt is little used outside text classificationIt has been used quite effectively for text classificationBut in general worse than Naïve BayesAgain, cheap to train and test documents15Sec.14.2Introduction to Information

View Full Document