Introduc)on*to*Informa)on*Retrieval* !! !!Introduc*on!to!Informa(on)Retrieval)CS276!Informa*on!Retrieval!and!Web!Search!Pandu!Nayak!and!Prabhakar!Raghavan!Lecture!7:!Scoring!and!results!assembly!Introduc)on*to*Informa)on*Retrieval* !! !!Lecture!6!–!I!introduced!a!bug!§ In!my!anxiety!to!avoid!taking!the!log!of!zero,!I!rewrote!as!!!!2!⎩⎨⎧>+=otherwise 0,0 tfif, tflog 1 10 t,dt,dt,dw⎩⎨⎧>+=otherwise 0,0 tfif),tf(1 log 10 t,dt,dt,dwIn fact this was unnecessary, since the zero case is treated specially above; net the FIRST version above is right.Introduc)on*to*Informa)on*Retrieval* !! !!Recap:!IJidf!weigh*ng!§ The!IJidf!weight!of!a!term!is!the!product!of!its!I!weight!and!its!idf!weight.!!§ Best!known!weigh*ng!scheme!in!informa*on!retrieval!§ Increases!with!the!number!of!occurrences!with in !a!document!§ Increases!with!the!rarity!of!the!term!in!the!collec*on!)df/(log)tflog1(w10,10,tdtNdt×+=Ch. 6Introduc)on*to*Informa)on*Retrieval* !! !!Recap:!Queries!as!vectors!§ Key!idea!1:!Do!the!same!fo r!queries:!represent!them!as!vectors!in!the!space!§ Key!idea!2:!Rank!documents!according!to!their!proximity!to!the!query!in!this!sp ace!§ proximity!=!similarity!of!vectors!Ch. 6Introduc)on*to*Informa)on*Retrieval* !! !!Recap:!cosine(query,document)!∑∑∑====•=•=ViiViiViiidqdqddqqdqdqdq12121),cos(Dot product Unit vectors cos(q,d) is the cosine similarity of q and d … or, equivalently, the cosine of the angle between q and d. Ch. 6Introduc)on*to*Informa)on*Retrieval* !! !!This!lecture!§ Speeding!up!vector!space!ranking!§ PuVng!together!a!complete!search !system!§ Will!require!learning!about!!a!number!of!miscellaneous!topics!and!heuris*cs!Ch. 7Introduc)on*to*Informa)on*Retrieval* !! !!Compu*ng!cosine!scores!Sec. 6.3.3Introduc)on*to*Informa)on*Retrieval* !! !!Efficient!cosine!ranking!§ Find!the!K!docs!in!the!collec*on!“nearest”! to!the!query!⇒!K*largest!queryJdoc!cosines.!§ Efficient!ranking:!§ Compu*ng!a!single!cosine!efficiently.!§ Choosing!the!K*largest!cosine!values!efficiently.!§ Can!we!do!this!without!compu*ng!all!N!cosines?*Sec. 7.1Introduc)on*to*Informa)on*Retrieval* !! !!Efficient!cosine!ranking!§ What!we’re!doing!in!effect:!solving!the!KJnearest!neighbor!problem!for!a!query!vector!§ In!general,!we!do!not!know!how!to!do!this!!efficientl y!for!highJdimensional!spaces!§ But!it!is!solvable!for!short!queries,!and!standard!indexes!support!this!well!Sec. 7.1Introduc)on*to*Informa)on*Retrieval* !! !!Special!case!–!unweighted!q ueries!§ No!weigh*ng!on!query!terms!§ Assume!each!query!term!occurs!on ly!on ce!§ Then!for!ranking,!donʼt!need!to!normalize!query!vector!§ Slight!simplifica*on!of!algorithm!from!Lecture!6!Sec. 7.1Introduc)on*to*Informa)on*Retrieval* !! !!Compu*ng!the!K!largest!cosines:!selec*on!vs.!sor*ng!§ Typically!we!want!to!retrieve!the!top!K!docs!(in!the!cosine!ranking!for!the!query)!§ not!to!totally!order!all!docs!in!the!collec*on!§ Can!we!pick!off!docs!with!K!highest!cosines?!§ Let!J!=!number!of!docs!with!nonzero!cosines!§ We!seek!the!K!best!of!these!J!Sec. 7.1Introduc)on*to*Informa)on*Retrieval* !! !!Use!heap!for!selec*ng!top!K!§ Binary!tree!in!which!each!node’s!value!>!the!values!of!children!§ Takes!2J!opera*ons!to!construct,!then!each!of!K*“winners”!read!off!in!2log!J!steps.!§ For!J=1M,!K=100,!this!is!about!10%!of!the!cost!of!sor*ng.!1 .9 .3 .8 .3 .1 .1 Sec. 7.1Introduc)on*to*Informa)on*Retrieval* !! !!Boflenecks!§ Primary!computa*onal!bofleneck!in!scoring:!cosine!computa*on!§ Can!we!avoid!all!this!computa*on?!§ Yes,!but!may!some*mes!get!it!wrong!§ a!doc!not!in!the!top!K!may!creep!into!the!li st!of!K!output!docs!§ Is!this!such!a!bad!thing?!Sec. 7.1.1Introduc)on*to*Informa)on*Retrieval* !! !!Cosine!similarity!is!only!a!proxy!§ User!has!a!task!and!a!query!formula*on!§ Cosine!matches!docs!to!query!§ Thus!cosine!is!anyway!a!proxy!for!user!happiness!§ If!we!get!a!list!of!K!docs!“close”!to!the!top!K!by!cosine!measure,!should!be!ok!Sec. 7.1.1Introduc)on*to*Informa)on*Retrieval* !! !!Generic!approach!§ Find!a!set!A*!of!contenders,!with!K*<*|A|*<<*N*§ A*does!not!necessarily!contain!the!top!K,*but!has!many!docs!from!among!the!top!K*§ Return!the!top!K*docs!in!A*§ Think!of!A!as!pruning!nonJcontenders!§ The!same!approach!is!also! used!for!other!(nonJcosine)!scoring!func*ons!§ Will!look!at!several!schemes!following!this!approach!Sec. 7.1.1Introduc)on*to*Informa)on*Retrieval* !! !!Index!elimina*on!§ Basic!algorithm!cosi ne!computa*on!algorithm!on ly!considers!docs! containing!at!least!one!query!term!§ Take!this!further:!§ Only!consider!highJidf!query!terms!§ Only!consider!docs!containing!many!query!terms!Sec. 7.1.2Introduc)on*to*Informa)on*Retrieval* !! !!HighJidf!quer y!terms!only!§ For!a!query!such!as!catcher*in*the*rye*§ Only!accumulate!scores!from!catcher*and!rye*§ Intui*on:!in*and!the*contribute!lifle!to!the!scores!and!so!donʼt!alter!rankJordering!much!§ Benefit:!§ Pos*ngs!of!lowJid f!terms!have!many!docs!→!these!(many)!docs!get!eliminated!from!set!A*of!contenders*Sec. 7.1.2Introduc)on*to*Informa)on*Retrieval* !! !!Docs!containing!many!quer y!terms!§ Any!doc!with!at!least!one!query!term!is!a!candidate!for!the!top!K!output!list!§ For!mul*Jterm!queries,!only!compute!scores!for!docs !containing!several!of!the!query!terms!§ Say,!at!least!3!out!o f!4!§ Imposes!a!“som!conjunc*on”!on!queries!seen!on!web!search!engines!(early!Google)!§ Easy!to!implement!in!pos*ngs!traversal!Sec. 7.1.2Introduc)on*to*Informa)on*Retrieval* !! !!3!of!4!query!terms!Brutus Caesar Calpurnia 1 2 3 5 8 13 21 34 2 4 8 16 32 64 128 13 16 Antony 3 4 8 16 32 64 128 32 Scores only computed for docs 8, 16 and 32. Sec. 7.1.2Introduc)on*to*Informa)on*Retrieval* !! !!Champion!lists!§ Precompute!for!each!dic*onary!term!t,!the!r!docs!of!highest!weight!in!tʼs!pos*ngs!§ Call!this!the!champion!list!for!t!§ (aka!fancy!list!or!top!docs!for!t)!§ Note!that!r!has!to!be!chosen!at!index!build!*me!§ Thus,!itʼs!possible!that!r!<!K*§
View Full Document