Unformatted text preview:

Introduc)on*to*Informa)on*Retrieval* !! !!Introduc*on!to!Informa(on)Retrieval)CS276:!Informa*on!Retrieval!and! Web!Search!Pandu!Nayak!and!Prabhakar!Raghavan!Lecture!4:!Index!Construc*on!Introduc)on*to*Informa)on*Retrieval* !! !!Plan!§ Last!lecture:!§ Dic*onary!data!structures!§ Tolerant!retrieval!§ Wildcards!§ Spell!correc*on!§ Soundex!§ This!*me:!§ Index !construc*on!a-hu hy-m n-z mo on among $m mace abandon amortize madden amongIntroduc)on*to*Informa)on*Retrieval* !! !!Index!construc*on!§ How!do!we!construct!an!index?!§ What!strategies!can!we!use!with!limited!main!memory?!Ch. 4Introduc)on*to*Informa)on*Retrieval* !! !!Hardware!basics!§ Many!design!decisions!in!informa*on!retrieval!are!based!on!the!characteris*cs!of!hardware!§ We!begin!by!reviewing!hardware!basics!Sec. 4.1Introduc)on*to*Informa)on*Retrieval* !! !!Hardware!basics!§ Access!to!data!in!memory!is!much! faster!than!access!to!data!on!disk.!§ Disk!seeks:!No!data!is!transferred!from!disk!while!the!disk!head!is!being!posi*oned.!§ Therefore:!Transferring!one!large!chunk!of!data!from!disk!to!memory!is!faster!than!transferring!many!small!chunks.!§ Disk!I/O!is!blockPbased:!Reading!and!wri*ng!of!en*re!blocks!(as!opposed!to!smaller!chunks).!§ Block!sizes:!8KB!to!256!KB.!Sec. 4.1Introduc)on*to*Informa)on*Retrieval* !! !!Hardware!basics!§ Ser vers!used!in!IR!systems!now!typically!have!several!GB!of!main!memory,!some*mes!tens!of!GB.!!§ Available!disk!space!is!several!(2–3)!orders!of!magnitude!larger.!§ Fault!tolerance!is!very!expensive:!Itʼs!much!cheaper!to!use!many!regular!machines!rather!than!one!fault!tolerant!machine.!Sec. 4.1Introduc)on*to*Informa)on*Retrieval* !! !!Hardware!assump*ons!for!this!lecture!§ symbol) )sta(s(c) ) ))))value)§ s ! !average!seek!*me!! !!5!ms!=!5!x!10−3!s!§ b! ! !transfer!*me!per!byte! !0.02!μs!=!2!x!10−8!s!§ !!!!!!! !processorʼs!clock!rate! !109!s−1!§ p ! !lowPlevel!opera*on! ! !0.01!μs!=!10−8!s!!!!!!!!!! !(e.g.,!compare!&!swap!a!word)!§ !!!!!!! !size!of!main!memory!! !several!GB!§ !!!!!!! !size!of!disk!space !! ! !1!TB!or!more!Sec. 4.1Introduc)on*to*Informa)on*Retrieval* !! !!RCV1:!Our!collec*on!for!this!lecture!§ Shakespeareʼs!collected!works!definitely!arenʼt!large!enough!for!demonstra*ng!many!of!the!po ints!in!this!course.!§ The!collec*on!weʼll!use!isnʼt!reall y!large!enough!either,!but!itʼs!publicly!available!and!is!at!l east!a!more!plau sib le!example.!§ As!an!example!for!applying!scalable!index!construc*on!algorithms,!we!will!use!the!Reuters!RCV1!collec*on.!§ This!is!one!year!of!Reuters!newswire!(part!of!1995!and!1996)!Sec. 4.2Introduc)on*to*Informa)on*Retrieval* !! !!A!Reuters!RC V1!do cument!Sec. 4.2Introduc)on*to*Informa)on*Retrieval* !! !!Reuters!RCV1!sta*s*cs!§ symbol )sta(s(c) ) )))))value)§ N! ! !!documents ! !!! ! !800,000!§ L! ! !!avg.!#!tokens!per!doc!! !200!§ M ! !!terms!(=!word!types)!! !400,000!§ !!!!!!!!!!!!!!!! !avg.!#!bytes!per!token! !6!!!!!!!!!!!!!!!!!!!! !(incl.!spaces/punct.)!§ !!!!!!!!!!!!!!!! !avg.!#!bytes!per!token! !4.5!!!!!!!!!!!!! !(without!spaces/punct.)!§ !!!!!!!!!!!!!!!! !avg.!#!bytes!per!term ! ! 7. 5!§ !!!!!!!!!!!!!!!! !nonPposi*onal!pos*ngs !100,000,000!4.5 bytes per word token vs. 7.5 bytes per word type: why? Sec. 4.2Introduc)on*to*Informa)on*Retrieval* !! !!§ Documents!are!parsed!to!extract!words!and!these!are!saved!with!the!Document!ID.!I did enact Julius Caesar I was killed i' the Capitol; Brutus killed me. Doc 1 So let it be with Caesar. The noble Brutus hath told you Caesar was ambitious Doc 2 Recall!IIR!1!index!construc*on!Term Doc #I 1did 1enact 1julius 1caesar 1I 1was 1killed 1i' 1the 1capitol 1brutus 1killed 1me 1so 2let 2it 2be 2with 2caesar 2the 2noble 2brutus 2hath 2told 2you 2caesar 2was 2ambitious 2Sec. 4.2Introduc)on*to*Informa)on*Retrieval* !! !!Term Doc #I 1did 1enact 1julius 1caesar 1I 1was 1killed 1i' 1the 1capitol 1brutus 1killed 1me 1so 2let 2it 2be 2with 2caesar 2the 2noble 2brutus 2hath 2told 2you 2caesar 2was 2ambitious 2Term Doc #ambitious 2be 2brutus 1brutus 2capitol 1caesar 1caesar 2caesar 2did 1enact 1hath 1I 1I 1i' 1it 2julius 1killed 1killed 1let 2me 1noble 2so 2the 1the 2told 2you 2was 1was 2with 2!Key!step!§ Ager!all!documents!have!been!parsed,!the!inverted!file!is!sorted!by!terms.!!We focus on this sort step. We have 100M items to sort. Sec. 4.2Introduc)on*to*Informa)on*Retrieval* !! !!Scaling!index!construc*on!§ InPmemory!index!construc*on!does!not!scale!§ Canʼt!stuff!en*re!coll ec*on!i nto!memory,!sort,!then!write!back!§ How!can!we!construct!an!index!for!very!large!collec*ons?!§ Taking!i nto!account!the!hardware!constraints!we!just!learned!about!.!.!.!§ Memory,!di sk,!speed,!etc.!Sec. 4.2Introduc)on*to*Informa)on*Retrieval* !! !!SortPbased!index!construc*on!§ As!we!build!the!index,!we!parse!docs!one!at!a!*me.!§ While!building!the!index,!we!cannot!easily!exploit!compression!tricks!!!(you!can,!but!much!more!complex)!§ The!final!pos*ngs!for!any!term!are!incomplete!un*l!the!end.!§ At!12!bytes!per!nonPposi*onal!pos*ngs!entry!(term,*doc,*freq),!demands!a!lot!of!space!for!large!collec*ons.!§ T!=!100,000,000!in!the!case!of!RCV1!§ So!…!we!can!do!th is!i n!memory!in!2009,!but!typical!collec*ons!are!much!larger.!!E.g.,!the!New*York*Times*provides!an!index!of!>150!years!of!newswire!§ Thus:!We!need!to!store!intermediate!results!on!disk.!Sec. 4.2Introduc)on*to*Informa)on*Retrieval* !! !!Sort!using!disk!as!“memory”?!§ Can!we!use!the!same!index!construc*on!algorithm!for!larger!collec*ons,!but!by!using!disk!instead!of!memory?!§ No:!Sor*ng!T!=!100,000,000!records!on!disk!is!too!slow!–!to o!many!disk!seeks.!§ We!need!an!external!sor*ng!algorithm.!Sec. 4.2Introduc)on*to*Informa)on*Retrieval* !! !!Bomleneck!§ Parse!and!build!pos*ngs!entries!one!doc!at!a!*me!§ Now!sort!pos*ngs!entries!by!term!(then!by!doc!within!each!term)!§ Doing!this!with!random!disk!seeks!would !be!too!slow!–!must!sort!T=100M!records!If every comparison took 2 disk seeks, and N items could be sorted with N log2N comparisons, how long would this take? Sec. 4.2Introduc)on*to*Informa)on*Retrieval* !! !!BSBI:!Blocked!so


View Full Document

Stanford CS 276 - Index Construction  

Documents in this Course
Load more
Download Index Construction  
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Index Construction   and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Index Construction   2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?