Introduc)on*to*Informa)on*Retrieval* !! !!Introduc*on!to!Informa(on)Retrieval)Introducing!ranked!retrieval!Introduc)on*to*Informa)on*Retrieval* !! !!Ranked!retrieval!§ Thus!far,!our!queries!have!all!been!Boolean.!§ Documents!either!match!or!don’t.!§ Good!for!expert!users!with!precise!understanding!of!their!needs!and!the!collec*on.!§ Also!good!for!applica*ons:!Applica*ons!can!easily!consume!1000s!of!results.!§ Not!good!for!the!majority!of!users.!§ Most!users!incapable!of!wri*ng!Boolean!queries!(or!they!are,!but!they!think!it’s!too!much!work).!§ Most!users!don’t!want!to!wade!through!1000s!of!results.!§ This!is!par*cularly!true!of!web!search.!Ch. 6Introduc)on*to*Informa)on*Retrieval* !! !!Problem!with!Boolean!search:!feast!or!famine!§ Boolean!queries!oNen!result!in!either!too!few!(≈0)!or!too!many!(1000s)!results.!§ Query!1:!“standard*user*dlink*650”!→!200,000!hits!§ Query!2:!“standard*user*dlink*650*no*card*found”!→!0!hits!§ It!takes!a!lot!of!skill!to!come!up!with!a!query!that!produces!a!manageable!number!of!hits.!§ AND!gives!too!few;!OR!gives!too!many!Ch. 6Introduc)on*to*Informa)on*Retrieval* !! !!Ranked!retrieval!models!§ Rather!than!a!set!of!documents!sa*sfying!a!query!expression,!i n !ranked!retrieval!models,!the!system!returns!an!ordering!over!the!(top)!documents!in!the!collec*on!with!respect!to!a!query!§ Free!text!queries:!Rather!than!a!query!language!of!operators!and!expressions,!the!user’s!query!is!just!one!or!more!words!in!a!human!language!§ In!principle,!there!are!two!separate!choices!here,!but!in!prac*ce,!ranked!retrieval!models!have!normally!been!associated!with!free!text!queries!and!vice!versa!!4!Introduc)on*to*Informa)on*Retrieval* !! !!Feast!or!famine:!not!a!problem!in!ranked!retrieval!§ When!a!system!produces!a!ranked!result!set,!large!result!sets!are!not!an!issue!§ Indeed,!the!size!of!the!result!set!is!not!an!issue!§ We!just!show!the!top!k*(!≈!10)!results!§ We!don’t!overwhelm!the!user!§ Premise:!the!ranking!algorithm!works!Ch. 6Introduc)on*to*Informa)on*Retrieval* !! !!Scoring!as!the!basis!of!ranked!retrieval!§ We!wish!to!return!in!order!the!documents!most!likely!to!be!useful!to!the!searcher!§ How!can!we!rank]order!the!documents!in!the!collec*on!with!respect!to!a!query?!§ Assign!a!score!–!say!in![0,!1]!–!to!each!document!§ This!score!measures!how!well!document!and!query!“match”.!Ch. 6Introduc)on*to*Informa)on*Retrieval* !! !!Query]document!matching!scores!§ We!need!a!way!of!assigning!a!score!to!a!query/document!pair!§ Let’s!start!with!a!one]term!query!§ If!the!query!term!does!not!occur!in!the!document:!score!should!be!0!§ The!more!frequent!the!query!term!in!the!document,!the!higher!the!score!(should!be)!§ We!will!look!at!a!number!of!alterna*ves!for!this!Ch. 6Introduc)on*to*Informa)on*Retrieval* !! !!Introduc*on!to!Informa(on)Retrieval)Introducing!ranked!retrieval!Introduc)on*to*Informa)on*Retrieval* !! !!Introduc*on!to!Informa(on)Retrieval)Scoring!with!the!Jaccard!coefficient!Introduc)on*to*Informa)on*Retrieval* !! !!Take!1:!Jaccard!coefficient!§ A!commonly!used!measure!of!overlap!of!two!sets!A!and!B*is!the!Jaccard!coefficient*§ jaccard(A,B)*=*|A*∩*B|!/!|A*∪!B|!§ jaccard(A,A)*=*1!§ jaccard(A,B)*=*0*if!A*∩*B*=*0!§ A!and!B!don’t!have!to!be!the!same!size.!§ Always!assigns!a!number!between!0!and!1.!Ch. 6Introduc)on*to*Informa)on*Retrieval* !! !!Jaccard!coefficient:!Scoring!example!§ What!is!the!query]document!match!score!that!the!Jaccard!coefficient!computes!for!each!of!the!two!documents!below?!§ Query:!ides*of*march*§ Document!1:!caesar*died*in*march*§ Document!2:!the*long*march!Ch. 6Introduc)on*to*Informa)on*Retrieval* !! !!Issues!with!Jaccard!for!scoring!§ It!doesn’t!consider!term*frequency*(how!many!*mes!a!term!occurs!in!a!document)!§ Rare!terms!in!a!collec*on!are!more!informa*ve!than!frequent!terms!!§ Jaccard!doesn’t!consider!this!informa*on!§ We!need!a!more!sophis*cated!way!of!normalizing!for!length!§ Later!in!this!lecture,!we’ll!use!!.!.!.!instead!of!|A!∩!B|/|A!∪!B|!(Jaccard)!for!length!normaliza*on.!| B A|/| B A| Ch. 6Introduc)on*to*Informa)on*Retrieval* !! !!Introduc*on!to!Informa(on)Retrieval)Scoring!with!the!Jaccard!coefficient!Introduc)on*to*Informa)on*Retrieval* !! !!Introduc*on!to!Informa(on)Retrieval)Term!frequency!weigh*ng!Introduc)on*to*Informa)on*Retrieval* !! !!Recall:!Binary!term]document!incidence!matrix!Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello MacbethAntony1 1 0 0 0 1Brutus1 1 0 1 0 0Caesar1 1 0 1 1 1Calpurnia0 1 0 0 0 0Cleopatra1 0 0 0 0 0mercy1 0 1 1 1 1worser1 0 1 1 1 0Each document is represented by a binary vector ∈ {0,1}|V| Sec. 6.2Introduc)on*to*Informa)on*Retrieval* !! !!Term]document!count!matrices!§ Consider!the!number!of!occurrences!of!a!term!in!a!document:!!§ Each!document!is!a!count!vector!in!ℕ|V|:!a!column!below!!Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello MacbethAntony157 73 0 0 0 0Brutus4 157 0 1 0 0Caesar232 227 0 2 1 1Calpurnia0 10 0 0 0 0Cleopatra57 0 0 0 0 0mercy2 0 3 5 5 1worser2 0 1 1 1 0Sec. 6.2Introduc)on*to*Informa)on*Retrieval* !! !!Term]document!count!matrices!§ Consider!the!number!of!occurrences!of!a!term!in!a!document:!!§ Each!document!is!a!count!vector!in!ℕ|V|:!a!column!below!!Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello MacbethAntony157 73 0 0 0 0Brutus4 157 0 1 0 0Caesar232 227 0 2 1 1Calpurnia0 10 0 0 0 0Cleopatra57 0 0 0 0 0mercy2 0 3 5 5 1worser2 0 1 1 1 0Sec. 6.2Introduc)on*to*Informa)on*Retrieval* !! !!Bag*of*words*model!§ Vector!representa*on!doesn’t!consider!the!ordering!of!words!in!a!document!§ John*is*quicker*than*Mary*and!Mary*is*quicker*than*John!have!the!same!vectors!§ This!is!called!the!bag)of)words)mod el . !§ In!a!sense,!this!is!a!step!back:!The!posi*onal!index!was!able!to!dis*nguish!these!two!documents!§ We!will!look!at!“recovering”!posi*onal!informa*on!later!on!§ For!now:!bag!of!words!model!Introduc)on*to*Informa)on*Retrieval* !! !!Term!frequency!k!§ The!term!frequency!kt,d!of!term!t!in!document!d!is!defined!as!the!number!of!*mes!that!t*occurs!in!d.!§
View Full Document