Stanford CS 276 - Text Classification - D611870

Home> Schools> Stanford University> Computer Science (CS) > CS 276> Text Classification

DOC PREVIEW

Stanford CS 276 - Text Classification

School name Stanford University

Course Cs 276- Information Retrieval and Web Search

Pages 46

This preview shows page 1-2-3-22-23-24-44-45-46 out of 46 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 46 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 46 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 46 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 46 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 46 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 46 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 46 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 46 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 46 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 46 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Introduc)on*to*Informa)on*Retrieval*Introduc)on*to*Informa(on)Retrieval)CS276:*Informa)on*Retrieval*and*Web*Search*Pandu*Nayak*and*Prabhakar*Raghavan *Lecture*11:*Tex t*Cl assiﬁ ca)on;*Vector*space*classiﬁca)on*[Borrows slides from Ray Mooney]Introduc)on*to*Informa)on*Retrieval*Recap:*Naïve*Bayes*classiﬁers* Classify*based*on*prior*weight*of*class*and*condi)onal*parameter*for*what*each*word*says:* Training*is*done*by*coun)ng*and*dividing:* Don’t*forget*to*smooth*2*€ cNB= argmaxcj∈Clog P(cj) + log P(xi| cj)i∈ positions∑⎡ ⎣ ⎢ ⎢ ⎤ ⎦ ⎥ ⎥ € P(cj) ←NcjN€ P(xk| cj) ←Tcjxk+α[Tcjxi+α]xi∈V∑Introduc)on*to*Informa)on*Retrieval*3*The*rest*of*text*classiﬁca)on* Today:* Vector*space*methods*for*Text*Classiﬁca)on* Vector*space*classiﬁca)on*using*centroids*(Ro cchio )* K*Nearest*Neighbors* Decision*boundaries,*linear*and*nonlinear*classiﬁers* Dealing*with*more*than*2*classes* Later*in*the*course* More*text*classiﬁca)on* Support*Vector*Machines* TextUspeciﬁc*issues*in*classiﬁca)on*Introduc)on*to*Informa)on*Retrieval*4*Recall:*Vector*Space*Representa)on* Each*document*is*a*vector,*one*component*for*each*term*(=*word).* Normally*normalize*vectors*to*unit*length.* HighUdimensional*vector*space:* Terms*are*axes* 10,000+*dimensions,*or*even*100,000+* Docs*are*vectors*in*this*space* How*can*we*do*classiﬁca)on*in*this*space?*Sec.14.1Introduc)on*to*Informa)on*Retrieval*5*Classiﬁca)on*Using*Vector*Spaces* As*before,*the*training*set*is*a*set*of*documents,*each*labeled*with*its*class*(e.g.,*topic)* In*vector*space*classiﬁca)on,*this*set*corresponds*to*a*labeled*set*of*points*(or,*equivalently,*vectors)*in*the*vector*space* Premise*1:*Documents*in*the*same*class*form*a*con)guous*region*of*space* Premise*2:*Documents*from*diﬀerent*classes*don’t*overlap*(much)* We*deﬁn e*surfaces*to*delineate*classes*in*the*space*Sec.14.1Introduc)on*to*Informa)on*Retrieval*6*Documents*in*a*Vector*Space*Government Science Arts Sec.14.1Introduc)on*to*Informa)on*Retrieval*7*Test*Document*of*what*class?*Government Science Arts Sec.14.1Introduc)on*to*Informa)on*Retrieval*8*Test*Document*=*Government*Government Science Arts Is this similarity hypothesis true in general? Our main topic today is how to find good separators Sec.14.1Introduc)on*to*Informa)on*Retrieval*Aside:*2D/3D*grap hs*can*be*misleading*9*Sec.14.1Introduc)on*to*Informa)on*Retrieval*Using*Rocchio*fo r*text*classiﬁca)on* Relevan ce*feedback*methods*can*be*adapted*for*text*categoriza)on* As*noted*before,*relevance*feedback*can*be*viewed*as*2Uclass*classiﬁca)on* Relevant*vs.*nonrelevant*documents* Use*standard*hUidf*weighted*vectors*to*represent*text*documents** For*training*documents*in*each*category,*compute*a*prototype*vector*by*summing*the*vectors *of*th e*training*documents*in*the*category.* Prototype*=*centroid*of*members*of*class* Assign*test*documents*to*the*category*with*the*closest*prototype*vector*based*on*cosine*similarity. *10*Sec.14.2Introduc)on*to*Informa)on*Retrieval*Illustra)on*of*Rocchio*Text*Categoriza)on*11*Sec.14.2Introduc)on*to*Informa)on*Retrieval*Deﬁni)on*of*centroid* Where*Dc*is*the*set*of*all*documents*that*belong*to*class*c*and*v(d)*is*the*vector*space*representa)on*of*d.* Note*that*centroid*will*i n*general*not*be*a*u nit*vector*even*when*the*inputs*are*unit*vectors.*12* €  µ (c) =1| Dc| v (d)d ∈Dc∑Sec.14.2Introduc)on*to*Informa)on*Retrieval*13*Rocchio*Proper)es** Fo rms*a*simpl e*generaliza)on*of*the*examples*in*each*class*(a*prototype).* Prototype*vector*does*not*need*to*be*averaged*or*otherwise*normalized*for*length*since*cosine*similarity*is*insensi)ve*to*vector*length.* Classiﬁca)on*is*based*on*similarity*to*class*prototypes.* Does*not*guarantee*classiﬁca)ons*are*consistent*with*the*given*training*data.***Why not? Sec.14.2Introduc)on*to*Informa)on*Retrieval*14*Rocchio*Anomaly**** Prototype*models*have*probl ems*with*p olymorph ic*(disjunc)ve)*categories.*Sec.14.2Introduc)on*to*Informa)on*Retrieval*Rocchio*classiﬁca)on* Rocchio*forms*a*simple*representa)on*for*each*class:*the*centroid /prototype** Classiﬁca)on*is*based*on*similarity*to*/*distance*from*the*prototype/centroi d* It*does*not*guarantee*that*classiﬁca)ons*are*consistent*with*the*given*train in g*data* It*is*likle*used*outside*text*classiﬁca)on* It*has*been*used*quite*eﬀec)vely*for*text*classiﬁca)on* But*in*general*worse*than*Naïve*Bayes* Again,*cheap*to*train*and*test*documents*15*Sec.14.2Introduc)on*to*Informa)on*Retrieval*16*k*Nearest*Neighbor*Classiﬁ ca)on* kNN*=*k*Nearest*Neighbor* To*classify*a*document*d*into*class*c:* Deﬁne*kUneighborhood*N*as*k*nearest*neighbors*of*d* Count*number*of*documents*i *in *N*that*belong*to*c* Es)mate*P(c|d)*as*i/k* Choose*as*class*argmaxc*P(c|d)****[*=*majority*class]*Sec.14.3Introduc)on*to*Informa)on*Retrieval*17*Example:*k=6*(6NN)*Government Science Arts P(science| )? Sec.14.3Introduc)on*to*Informa)on*Retrieval*18*NearestUNeighbor*Learning*Algo rithm* Learning*is*just*storing*the*representa)ons*of*the*training*examples*in*D.* Tes)ng*instance*x*(under*1NN):* Compute*similarity*between*x*and*all*examples*in*D.* Assign*x*the*category*of*the*most*similar*example*in*D.* Does*not*explicitly*compute*a*general iza)on*or*category*prototypes.* Also*called:* CaseUbased*learning* MemoryUbased*learning* Lazy*learning* Ra)onale*of*kNN:*con)guity*hypothesis*Sec.14.3Introduc)on*to*Informa)on*Retrieval*19*kNN*Is*Close*to*Op)mal * Cover*and*Hart*(1967)* Asympto)cally,*the*error*rate*of*1UnearestUneighbor*classiﬁca)on*is*less*than*twice*the*Bayes*rate*[error*rate*of*classiﬁer*knowing*model*that*generated*data]* In*par)cular,*asympto)c*error*rate*is*0*if*Bayes*rate*is*0.* Assume:*query*point*coincides*with*a*training*point.* Both*query*point*and*training*point*contribute*error*→*2*)mes*Bayes*rate*Sec.14.3Introduc)on*to*Informa)on*Retrieval*20*k*Nearest*Neighbor* Using*only*the*closest*example*(1NN)*to*determine*the*class*is*subject*to*errors*due*to:* A*single*atypical*example.** Noise*(i.e.,*an*error)*in

View Full Document