Unformatted text preview:

Introduc)on*to*Informa)on*Retrieval* !! !!Introduc*on!to!Informa(on)Retrieval)Introducing!Informa*on!Retrieval!!and!Web!Search!Introduc)on*to*Informa)on*Retrieval* !! !!Informa*on!Retrieval!§ Informa*on!Retrieval!(IR)!is!finding!material!(usually!documents)!of!an!unstructured!nature!(usually!text)!that!sa*sfies!an!informa*on!need!from!within!large!collec*ons!(usually!stored!on!computers).!§ These!days!we!frequently!think!first!of!web!search,!but!there!are!many!other!cases:!§ EGmail!search!§ Searching!your!laptop!§ Corporate!knowledge!bases!§ Legal!informa*on!retrieval!2!Introduc)on*to*Informa)on*Retrieval* !! !!Unstructured!(text)!vs.!structured!(database)!data!in!the!midGnine*es!3!0 50 100 150 200 250 Data volume Market Cap Unstructured StructuredIntroduc)on*to*Informa)on*Retrieval* !! !!Unstructured!(text)!vs.!structured!(database)!data!today!4!0 50 100 150 200 250 Data volume Market Cap Unstructured StructuredIntroduc)on*to*Informa)on*Retrieval* !! !!Basic!assump*ons!of!Informa*on!Retrieval!§ Collec*on:!A!set!of!documents!§ Assume!it!is!a!sta*c!collec*on!for!the!moment!§ Goal:!Retrieve!documents!with!informa*on!that!is!relevant!to!the!user’s!informa*on!need!and!helps!the!user!complete!a!task!5!Sec. 1.1Introduc)on*to*Informa)on*Retrieval* !! !!how$trap$mice$alive$The!classic!search!model!Collection User task Info need Query Results Search engine Query refinement Get rid of mice in a politically correct way Info about removing mice without killing them Misconception? Misformulation? Search!Introduc)on*to*Informa)on*Retrieval* !! !!How!good!are!the!retrieved!docs?!§ Precision*:!Frac*on!of!retrieved!docs!that!are!relevant!to!the!user’s!informa*on!need!§ Recall!:!Frac*on!of!relevant!docs!in!collec*on!that!are!retrieved!§ More!precise!defini*ons!and!measurements!to!follow!later!7!Sec. 1.1Introduc)on*to*Informa)on*Retrieval* !! !!Introduc*on!to!Informa(on)Retrieval)Introducing!Informa*on!Retrieval!!and!Web!Search!Introduc)on*to*Informa)on*Retrieval* !! !!Introduc*on!to!Informa(on)Retrieval)TermGdocument!incidence!matrices!Introduc)on*to*Informa)on*Retrieval* !! !!Unstructured!data!in!1620!§ Which!plays!of!Shakespeare!contain!the!words!Brutus!AND!Caesar!!but!NOT!Calpurnia?!§ One!could!grep!all!of!Shakespeare’s!plays!for!Brutus!and!Caesar,!then!strip!out!lines!containing!Calpurnia?!§ Why!is!that!not!the!answer?!§ Slow!(for!large!corpora)!§ NOT!Calpurnia!is!nonGtrivial!§ Other!opera*ons!(e.g.,!find!the!word!Romans1near)countrymen)!not!feasible!§ Ranked!retrieval!(best!documents!to!return)!§ Later!lectures!10!Sec. 1.1Introduc)on*to*Informa)on*Retrieval* !! !!TermGdocument!incidence!matrices!Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello MacbethAntony 1 1 0 0 0 1Brutus 1 1 0 1 0 0Caesar 1 1 0 1 1 1Calpurnia 0 1 0 0 0 0Cleopatra 1 0 0 0 0 0mercy 1 0 1 1 1 1worser 1 0 1 1 1 01 if play contains word, 0 otherwise Brutus AND Caesar BUT NOT Calpurnia Sec. 1.1Introduc)on*to*Informa)on*Retrieval* !! !!Incidence!vectors!§ So!we!have!a!0/1!vector!for!each!term.!§ To!answer!query:!take!the!vectors!for!Brutus,1Caesar!and!Calpurnia!(complemented)!è!!bitwise!AND.!§ 110100!AND*§ 110111!AND*§ 101111!=!!§ 100100)12!Sec. 1.1 Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello MacbethAntony 1 1 0 0 0 1Brutus 1 1 0 1 0 0Caesar 1 1 0 1 1 1Calpurnia 0 1 0 0 0 0Cleopatra 1 0 0 0 0 0mercy 1 0 1 1 1 1worser 1 0 1 1 1 0Introduc)on*to*Informa)on*Retrieval* !! !!Answers!to!query!§ Antony and Cleopatra,!Act III, Scene ii Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar dead, He cried almost to roaring; and he wept When at Philippi he found Brutus slain. § Hamlet, Act III, Scene ii Lord Polonius: I did enact Julius Caesar I was killed i’ the Capitol; Brutus killed me. 13!Sec. 1.1Introduc)on*to*Informa)on*Retrieval* !! !!Bigger!collec*ons!§ Consider!N*=!1!million!documents,!each!with!about!1000!words.!§ Avg!6!bytes/word!including!spaces/punctua*on!!§ 6GB!of!data!in!the!documents.!§ Say!there!are!M*=!500K!dis)nct!terms!among!these.!14!Sec. 1.1Introduc)on*to*Informa)on*Retrieval* !! !!Can’t!build!the!matrix!§ 500K!x!1M!matrix!has!halfGaGtrillion!0’s!and!1’s.!§ But!it!has!no!more!than!one!billion!1’s.!§ matrix!is!extremely!sparse.!§ What’s!a!be_er!representa*on?!§ We!only!record!the!1!posi*ons.!15!Why? Sec. 1.1Introduc)on*to*Informa)on*Retrieval* !! !!Introduc*on!to!Informa(on)Retrieval)TermGdocument!incidence!matrices!Introduc)on*to*Informa)on*Retrieval* !! !!Introduc*on!to!Informa(on)Retrieval)The!Inverted!Index!The!key!data!structure!underlying!modern!IR!Introduc)on*to*Informa)on*Retrieval* !! !!Inverted!index!§ For!each!term!t,!we!must!store!a!list!of!all!documents!that!contain!t.!§ Iden*fy!each!doc!by!a!docID,!a!document!serial!number!§ Can!we!used!fixedGsize!arrays!for!this?!18!What!happens!if!the!word!Caesar!is!added!to!document!14?!!Sec. 1.2 Brutus1Calpurnia1Caesar1 1 2 4 5 6 16 57 132 1 2 4 11 31 45 173 2 31 174 54 101Introduc)on*to*Informa)on*Retrieval* !! !!Inverted!index!§ We!need!variableGsize!pos*ngs!lists!§ On!disk,!a!con*nuous!run!of!pos*ngs!is!normal!and!best!§ In!memory,!can!use!linked!lists!or!variable!length!arrays!§ Some!tradeoffs!in!size/ease!of!inser*on!19!Dictionary Postings Sorted by docID (more later on why). Pos)ng*Sec. 1.2 Brutus1Calpurnia1Caesar1 1 2 4 5 6 16 57 132 1 2 4 11 31 45 173 2 31 174 54 101Introduc)on*to*Informa)on*Retrieval* !! !!Tokenizer Token stream Friends Romans Countrymen Inverted!index!construc*on!Linguistic modules Modified tokens friend roman countryman Indexer Inverted index friend1roman1countryman12 4 2 13 16 1 Documents to be indexed Friends, Romans, countrymen. Sec. 1.2Introduc)on*to*Informa)on*Retrieval* !! !!Tokenizer Token stream Friends Romans Countrymen Inverted!index!construc*on!Linguistic modules Modified tokens friend roman countryman Indexer Inverted index friend1roman1countryman12 4 2 13 16 1 More on these later. Documents to be indexed Friends, Romans, countrymen. Sec. 1.2Introduc)on*to*Informa)on*Retrieval* !!


View Full Document

Stanford CS 124 - Lecture Notes

Download Lecture Notes
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture Notes and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture Notes 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?