Introduc)on*to*Informa)on*Retrieval* !! !!Introduc*on!to!Informa(on)Retrieval)CS276:!Informa*on!Retrieval!and! Web!Search!Pandu!Nayak!and!Prabhakar!Raghavan!Lecture!6:!Scoring,!Term!Weigh*ng!and!the!Vector!Space!Model!Introduc)on*to*Informa)on*Retrieval* !! !!Recap!of!lecture!5!§ Collec*on!and!vocab ul ary!sta*s*cs:!Heapsʼ!and!Zipfʼs!laws!§ Dic*onary!compression!for!Boolean!indexes!§ Dic*onary!string,!blocks,!front!coding!§ Pos*ngs!compression:!Gap!encoding,!prefixQunique!codes!§ VariableQByte!and!Gamma!codes!collection (text, xml markup etc) 3,600.0 collection (text) 960.0 Term-doc incidence matrix 40,000.0 postings, uncompressed (32-bit words) 400.0 postings, uncompressed (20 bits) 250.0 postings, variable byte encoded 116.0 postings, γ-encoded 101.0 MBIntroduc)on*to*Informa)on*Retrieval* !! !!This!lecture;!IIR!Sec*ons!6.2Q6.4.3!§ Ranked!retrieval!§ Scoring!documents!§ Term!frequency!§ Collec*on!sta*s*cs!§ Weigh*ng!schemes!§ Vector!space!scoring!Introduc)on*to*Informa)on*Retrieval* !! !!Ranked!retrieval!§ Thus!far,!our!queries!have!all!been!Boolean.!§ Documents!either!match!or!donʼt.!§ Good!for!expert!users!with!precise!understanding!of!their!needs!and!the!collec*on.!§ Also!good!for!applica*ons:!Applica*ons!can !easil y!consume!1000s!of!results.!§ Not!good!for!the!majority!of!users.!§ Most!users!incapable!of!wri*ng!Boolean!queries!(or!they!are,!but!they!think!itʼs!too!much!work).!§ Most!users!donʼt!want!to!wad e!through!1000s!of!results.!§ This!is!par*cularly!true!of!web!search.!Ch. 6Introduc)on*to*Informa)on*Retrieval* !! !!Prob lem!with !Bool ean!search:!feast!or!famine!§ Boolean!queries!o]en!result!in!either!too!few!(=0)!or!too!many!(1000s)!results.!§ Query!1:!“standard*user*dlink*650”!→!200,000!hits!§ Query!2:!“standard*user*dlink*650*no*card*found”:!0!hits!§ It!takes!a!lot!of!skill!to!come!up!with!a!query!that!produces!a!manageable!number!of!hits.!§ AND!gives!too!few;!OR!gives!too!many!Ch. 6Introduc)on*to*Informa)on*Retrieval* !! !!Ranked!retrieval!models!§ Rather!than!a!set!of!documents!sa*sfying!a!query!expression,!in!ranked!retrieval,!the!system!returns!an!ordering!over!the!(top)!documents!in!the!collec*on!for!a!query!§ Free!text!queries:!Rather!than!a!query!language!of!operators!and! expressions,!the!userʼs!query!is!just!one!or!more!words!in!a!hu man!l anguage!§ In!principle,!there!are!two!separate!choices!here,!but!in!prac*ce,!ranked!retrieval!h as!no rmally!been!associated!with!free!text!queries!and!vice!versa!!6!Introduc)on*to*Informa)on*Retrieval* !! !!Feast!or!famine:!not!a!problem!in!ranked!retrieval!§ When!a!system!produces!a!ranked!result!set,!large!result!sets!are!not!an!issue!§ Indeed,!the!size!of!the!result!set!is!not!an!issue!§ We!just!show!the!top!k*(!≈!10)!results!§ We!donʼt!overwhelm!the!user!§ Premise:!the!ranking!algorithm!works!Ch. 6Introduc)on*to*Informa)on*Retrieval* !! !!Scoring!as!the!basis!of!ranked!retrieval!§ We!wish!to!return!in!order!the!documents!most!likely!to!be!useful!to!the!searcher!§ How!can!we!rankQorder!the!documents!in!the!collec*on!with!respect!to!a!query?!§ Assign!a!score!–!say!in![0,!1]!–!to!each!document!§ This!score!measures!how!well !docu ment!and!query!“match”.!Ch. 6Introduc)on*to*Informa)on*Retrieval* !! !!QueryQdocument!matchin g!scores!§ We!need!a!way!of! assigni ng!a!score!to!a!query/document!pair!§ Letʼs!start!with!a!oneQterm!query!§ If!the!query!term!does!not!occur!in!the!document:!score!should!be!0!§ The!more!frequent!the!query!term!in!the!document,!the!higher!the!score!(should!be)!§ We!will!look!at!a!number!of!alterna*ves!for!this.!Ch. 6Introduc)on*to*Informa)on*Retrieval* !! !!Take!1:!Jaccard!coefficient!§ Recall!from!Lecture!3:!A!commonly!used!measure!of!overlap!of!two!sets!A!and!B*§ jaccard(A,B)*=*|A*∩*B|!/!|A*∪!B|!§ jaccard(A,A)*=*1!§ jaccard(A,B)*=*0*if!A*∩*B*=*0!§ A!and!B!donʼt!have!to!b e!the!same!size.!§ Always!assigns!a!number!between!0!and!1.!Ch. 6Introduc)on*to*Informa)on*Retrieval* !! !!Jaccard!coefficient:!Scoring!example!§ What!is!the!queryQdocument!match!score!that!the!Jaccard!coefficient!computes!for!each!of!the!two!documents!below?!§ Query:!ides*of*march*§ Document!1:!caesar*died*in*march*§ Document!2:!the*long*march!Ch. 6Introduc)on*to*Informa)on*Retrieval* !! !!Issues!with!Jaccard!for!scoring!§ It!doesnʼt!consider!term*frequency*(how!many!*mes!a!term!occurs!in!a!document)!§ Rare!terms!in!a!collec*on!are!more!informa*ve!than!frequent!terms.!Jaccard!doesnʼt!consider!this!informa*on!§ We!need!a!more!sophis*cated!way!of!normalizing!for!length!§ Later!in!this!lecture,!weʼll!use!!§ .!.!.!instead!of!|A!∩!B|/|A!∪!B|!(Jaccard)!for!length!normaliza*on.!| B A|/| B A| Ch. 6Introduc)on*to*Informa)on*Retrieval* !! !!Recall!(Lecture!1):!Binary!termQdocument!incidence!matrix!Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello MacbethAntony1 1 0 0 0 1Brutus1 1 0 1 0 0Caesar1 1 0 1 1 1Calpurnia0 1 0 0 0 0Cleopatra1 0 0 0 0 0mercy1 0 1 1 1 1worser1 0 1 1 1 0Each document is represented by a binary vector ∈ {0,1}|V| Sec. 6.2Introduc)on*to*Informa)on*Retrieval* !! !!TermQdocument!count!matrices!§ Consider!the!number!of!occurrences!of!a!term!in!a!document:!!§ Each!document!is!a!count!vector!in!ℕv:!a!column!below!!Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello MacbethAntony157 73 0 0 0 0Brutus4 157 0 1 0 0Caesar232 227 0 2 1 1Calpurnia0 10 0 0 0 0Cleopatra57 0 0 0 0 0mercy2 0 3 5 5 1worser2 0 1 1 1 0Sec. 6.2Introduc)on*to*Informa)on*Retrieval* !! !!Bag*of*words*model!§ Vector!representa*on!doesnʼt!consider!the!ordering!of!words!in!a!document!§ John*is*quicker*than*Mary*and!Mary*is*quicker*than*John!have!the!same!vectors!§ This!is!called!the!bag!of!words!model.!§ In!a!sense,!this!is!a!step!back:!The!posi*onal!index!was!able!to!dis*nguish!these!two!documents .!§ We!will!look!at!“recovering”!posi*onal!info rma*on!later!in!this!course.!§ For!now:!bag!of!words!model!Introduc)on*to*Informa)on*Retrieval* !! !!Term!frequency!o!§
View Full Document