DOC PREVIEW
Stanford CS 276 - Introduction to Information Retrieval

This preview shows page 1-2-3-22-23-24-44-45-46 out of 46 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 46 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 46 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 46 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 46 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 46 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 46 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 46 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 46 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 46 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 46 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Slide 1Recap: Naïve Bayes classifiersThe rest of text classificationRecall: Vector Space RepresentationClassification Using Vector SpacesDocuments in a Vector SpaceTest Document of what class?Test Document = GovernmentAside: 2D/3D graphs can be misleadingUsing Rocchio for text classificationIllustration of Rocchio Text CategorizationDefinition of centroidRocchio PropertiesRocchio AnomalyRocchio classificationk Nearest Neighbor ClassificationExample: k=6 (6NN)Nearest-Neighbor Learning AlgorithmkNN Is Close to Optimalk Nearest NeighborkNN decision boundariesSimilarity MetricsIllustration of 3 Nearest Neighbor for Text Vector Space3 Nearest Neighbor vs. RocchioNearest Neighbor with Inverted IndexkNN: DiscussionkNN vs. Naive BayesBias vs. variance: Choosing the correct model capacityLinear classifiers and binary and multiclass classificationSeparation by HyperplanesLinear programming / PerceptronWhich Hyperplane?Slide 33Linear classifier: ExampleLinear ClassifiersRocchio is a linear classifierTwo-class Rocchio as a linear classifierNaive Bayes is a linear classifierA nonlinear problemHigh Dimensional DataMore Than Two ClassesSet of Binary Classifiers: Any ofSet of Binary Classifiers: One ofSummary: Representation of Text Categorization AttributesWhich classifier do I use for a given text classification problem?Resources for today’s lectureIntroduction to Information RetrievalIntroduction to Information Retrieval Introduction toInformation RetrievalCS276: Information Retrieval and Web SearchPandu Nayak and Prabhakar RaghavanLecture 11: Text Classification;Vector space classification[Borrows slides from Ray Mooney]Introduction to Information RetrievalIntroduction to Information Retrieval Recap: Naïve Bayes classifiersClassify based on prior weight of class and conditional parameter for what each word says:Training is done by counting and dividing:Don’t forget to smooth2€ cNB= argmaxcj∈Clog P(cj) + log P(xi| cj)i∈positions∑ ⎡ ⎣ ⎢ ⎢ ⎤ ⎦ ⎥ ⎥€ P(cj) ←NcjN€ P(xk| cj) ←Tcjxk+ α[Tcjxi+ α ]xi∈V∑Introduction to Information RetrievalIntroduction to Information Retrieval 3The rest of text classificationToday:Vector space methods for Text ClassificationVector space classification using centroids (Rocchio)K Nearest NeighborsDecision boundaries, linear and nonlinear classifiersDealing with more than 2 classesLater in the courseMore text classificationSupport Vector MachinesText-specific issues in classificationIntroduction to Information RetrievalIntroduction to Information Retrieval 4Recall: Vector Space RepresentationEach document is a vector, one component for each term (= word).Normally normalize vectors to unit length.High-dimensional vector space:Terms are axes10,000+ dimensions, or even 100,000+Docs are vectors in this spaceHow can we do classification in this space?Sec.14.1Introduction to Information RetrievalIntroduction to Information Retrieval 5Classification Using Vector SpacesAs before, the training set is a set of documents, each labeled with its class (e.g., topic)In vector space classification, this set corresponds to a labeled set of points (or, equivalently, vectors) in the vector spacePremise 1: Documents in the same class form a contiguous region of spacePremise 2: Documents from different classes don’t overlap (much)We define surfaces to delineate classes in the spaceSec.14.1Introduction to Information RetrievalIntroduction to Information Retrieval 6Documents in a Vector SpaceGovernmentScienceArtsSec.14.1Introduction to Information RetrievalIntroduction to Information Retrieval 7Test Document of what class?GovernmentScienceArtsSec.14.1Introduction to Information RetrievalIntroduction to Information Retrieval 8Test Document = GovernmentGovernmentScienceArtsIs this similarityhypothesistrue ingeneral?Our main topic today is how to find good separatorsSec.14.1Introduction to Information RetrievalIntroduction to Information Retrieval Aside: 2D/3D graphs can be misleading9Sec.14.1Introduction to Information RetrievalIntroduction to Information Retrieval Using Rocchio for text classificationRelevance feedback methods can be adapted for text categorizationAs noted before, relevance feedback can be viewed as 2-class classificationRelevant vs. nonrelevant documentsUse standard tf-idf weighted vectors to represent text documents For training documents in each category, compute a prototype vector by summing the vectors of the training documents in the category.Prototype = centroid of members of classAssign test documents to the category with the closest prototype vector based on cosine similarity.10Sec.14.2Introduction to Information RetrievalIntroduction to Information Retrieval Illustration of Rocchio Text Categorization11Sec.14.2Introduction to Information RetrievalIntroduction to Information Retrieval Definition of centroidWhere Dc is the set of all documents that belong to class c and v(d) is the vector space representation of d.Note that centroid will in general not be a unit vector even when the inputs are unit vectors.12 € r μ (c) =1| Dc|r v (d)d ∈Dc∑Sec.14.2Introduction to Information RetrievalIntroduction to Information Retrieval 13Rocchio Properties Forms a simple generalization of the examples in each class (a prototype).Prototype vector does not need to be averaged or otherwise normalized for length since cosine similarity is insensitive to vector length.Classification is based on similarity to class prototypes.Does not guarantee classifications are consistent with the given training data. Why not?Sec.14.2Introduction to Information RetrievalIntroduction to Information Retrieval 14Rocchio Anomaly Prototype models have problems with polymorphic (disjunctive) categories.Sec.14.2Introduction to Information RetrievalIntroduction to Information Retrieval Rocchio classificationRocchio forms a simple representation for each class: the centroid/prototype Classification is based on similarity to / distance from the prototype/centroidIt does not guarantee that classifications are consistent with the given training dataIt is little used outside text classificationIt has been used quite effectively for text classificationBut in general worse than Naïve BayesAgain, cheap to train and test documents15Sec.14.2Introduction to Information


View Full Document

Stanford CS 276 - Introduction to Information Retrieval

Documents in this Course
Load more
Download Introduction to Information Retrieval
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Introduction to Information Retrieval and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Introduction to Information Retrieval 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?