Introduc)on*to*Informa)on*Retrieval* !! !!Introduc*on!to!Informa(on)Retrieval)Introducing!Informa*on!Retrieval!!and!Web!Search!Introduc)on*to*Informa)on*Retrieval* !! !!Informa*on!Retrieval!§ Informa*on!Retrieval!(IR)!is!finding!material!(usually!documents)!of!an!unstructured!nature!(usually!text)!that!sa*sfies!an!informa*on!need!from!within!large!collec*ons!(usually!stored!on!computers).!§ These!days!we!frequently!think!first!of!web!search,!but!there!are!many!other!cases:!§ EGmail!search!§ Searching!your!laptop!§ Corporate!knowledge!bases!§ Legal!informa*on!retrieval!2!Introduc)on*to*Informa)on*Retrieval* !! !!Unstructured!(text)!vs.!structured!(database)!data!in!the!midGnine*es!3!0 50 100 150 200 250 Data volume Market Cap Unstructured StructuredIntroduc)on*to*Informa)on*Retrieval* !! !!Unstructured!(text)!vs.!structured!(database)!data!today!4!0 50 100 150 200 250 Data volume Market Cap Unstructured StructuredIntroduc)on*to*Informa)on*Retrieval* !! !!Basic!assump*ons!of!Informa*on!Retrieval!§ Collec*on:!A!set!of!documents!§ Assume!it!is!a!sta*c!collec*on!for!the!moment!§ Goal:!Retrieve!documents!with!informa*on!that!is!relevant!to!the!user’s!informa*on!need!and!helps!the!user!complete!a!task!5!Sec. 1.1Introduc)on*to*Informa)on*Retrieval* !! !!how$trap$mice$alive$The!classic!search!model!Collection User task Info need Query Results Search engine Query refinement Get rid of mice in a politically correct way Info about removing mice without killing them Misconception? Misformulation? Search!Introduc)on*to*Informa)on*Retrieval* !! !!How!good!are!the!retrieved!docs?!§ Precision*:!Frac*on!of!retrieved!docs!that!are!relevant!to!the!user’s!informa*on!need!§ Recall!:!Frac*on!of!relevant!docs!in!collec*on!that!are!retrieved!§ More!precise!defini*ons!and!measurements!to!follow!later!7!Sec. 1.1Introduc)on*to*Informa)on*Retrieval* !! !!Introduc*on!to!Informa(on)Retrieval)Introducing!Informa*on!Retrieval!!and!Web!Search!Introduc)on*to*Informa)on*Retrieval* !! !!Introduc*on!to!Informa(on)Retrieval)TermGdocument!incidence!matrices!Introduc)on*to*Informa)on*Retrieval* !! !!Unstructured!data!in!1620!§ Which!plays!of!Shakespeare!contain!the!words!Brutus!AND!Caesar!!but!NOT!Calpurnia?!§ One!could!grep!all!of!Shakespeare’s!plays!for!Brutus!and!Caesar,!then!strip!out!lines!containing!Calpurnia?!§ Why!is!that!not!the!answer?!§ Slow!(for!large!corpora)!§ NOT!Calpurnia!is!nonGtrivial!§ Other!opera*ons!(e.g.,!find!the!word!Romans1near)countrymen)!not!feasible!§ Ranked!retrieval!(best!documents!to!return)!§ Later!lectures!10!Sec. 1.1Introduc)on*to*Informa)on*Retrieval* !! !!TermGdocument!incidence!matrices!Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello MacbethAntony 1 1 0 0 0 1Brutus 1 1 0 1 0 0Caesar 1 1 0 1 1 1Calpurnia 0 1 0 0 0 0Cleopatra 1 0 0 0 0 0mercy 1 0 1 1 1 1worser 1 0 1 1 1 01 if play contains word, 0 otherwise Brutus AND Caesar BUT NOT Calpurnia Sec. 1.1Introduc)on*to*Informa)on*Retrieval* !! !!Incidence!vectors!§ So!we!have!a!0/1!vector!for!each!term.!§ To!answer!query:!take!the!vectors!for!Brutus,1Caesar!and!Calpurnia!(complemented)!è!!bitwise!AND.!§ 110100!AND*§ 110111!AND*§ 101111!=!!§ 100100)12!Sec. 1.1 Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello MacbethAntony 1 1 0 0 0 1Brutus 1 1 0 1 0 0Caesar 1 1 0 1 1 1Calpurnia 0 1 0 0 0 0Cleopatra 1 0 0 0 0 0mercy 1 0 1 1 1 1worser 1 0 1 1 1 0Introduc)on*to*Informa)on*Retrieval* !! !!Answers!to!query!§ Antony and Cleopatra,!Act III, Scene ii Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar dead, He cried almost to roaring; and he wept When at Philippi he found Brutus slain. § Hamlet, Act III, Scene ii Lord Polonius: I did enact Julius Caesar I was killed i’ the Capitol; Brutus killed me. 13!Sec. 1.1Introduc)on*to*Informa)on*Retrieval* !! !!Bigger!collec*ons!§ Consider!N*=!1!million!documents,!each!with!about!1000!words.!§ Avg!6!bytes/word!including!spaces/punctua*on!!§ 6GB!of!data!in!the!documents.!§ Say!there!are!M*=!500K!dis)nct!terms!among!these.!14!Sec. 1.1Introduc)on*to*Informa)on*Retrieval* !! !!Can’t!build!the!matrix!§ 500K!x!1M!matrix!has!halfGaGtrillion!0’s!and!1’s.!§ But!it!has!no!more!than!one!billion!1’s.!§ matrix!is!extremely!sparse.!§ What’s!a!be_er!representa*on?!§ We!only!record!the!1!posi*ons.!15!Why? Sec. 1.1Introduc)on*to*Informa)on*Retrieval* !! !!Introduc*on!to!Informa(on)Retrieval)TermGdocument!incidence!matrices!Introduc)on*to*Informa)on*Retrieval* !! !!Introduc*on!to!Informa(on)Retrieval)The!Inverted!Index!The!key!data!structure!underlying!modern!IR!Introduc)on*to*Informa)on*Retrieval* !! !!Inverted!index!§ For!each!term!t,!we!must!store!a!list!of!all!documents!that!contain!t.!§ Iden*fy!each!doc!by!a!docID,!a!document!serial!number!§ Can!we!used!fixedGsize!arrays!for!this?!18!What!happens!if!the!word!Caesar!is!added!to!document!14?!!Sec. 1.2 Brutus1Calpurnia1Caesar1 1 2 4 5 6 16 57 132 1 2 4 11 31 45 173 2 31 174 54 101Introduc)on*to*Informa)on*Retrieval* !! !!Inverted!index!§ We!need!variableGsize!pos*ngs!lists!§ On!disk,!a!con*nuous!run!of!pos*ngs!is!normal!and!best!§ In!memory,!can!use!linked!lists!or!variable!length!arrays!§ Some!tradeoffs!in!size/ease!of!inser*on!19!Dictionary Postings Sorted by docID (more later on why). Pos)ng*Sec. 1.2 Brutus1Calpurnia1Caesar1 1 2 4 5 6 16 57 132 1 2 4 11 31 45 173 2 31 174 54 101Introduc)on*to*Informa)on*Retrieval* !! !!Tokenizer Token stream Friends Romans Countrymen Inverted!index!construc*on!Linguistic modules Modified tokens friend roman countryman Indexer Inverted index friend1roman1countryman12 4 2 13 16 1 Documents to be indexed Friends, Romans, countrymen. Sec. 1.2Introduc)on*to*Informa)on*Retrieval* !! !!Tokenizer Token stream Friends Romans Countrymen Inverted!index!construc*on!Linguistic modules Modified tokens friend roman countryman Indexer Inverted index friend1roman1countryman12 4 2 13 16 1 More on these later. Documents to be indexed Friends, Romans, countrymen. Sec. 1.2Introduc)on*to*Informa)on*Retrieval* !!
View Full Document