Princeton COS 435 - Building the index - D223694

Home> Schools> Princeton University> Computer Science (COS) > COS 435> Building the index

Princeton COS 435 - Building the index

Course Cos 435- Networks, Economics and Computing

Pages 9

Download Save

Unformatted text preview:

11Building the index2Review• Have dictionary of terms• (Inverted) index referenced by terms of dictionary• An entry for a term in the index is the posting list for the term• A posting list is of the (general) form:( [DocID 1st doc containing term; summary attributes w.r.t term;( position term 1st occurs: attributes, … position term last occurs: attributes)],…[DocID last doc containing term; summary attributes w.r.t. term;( position term 1st occurs: attributes, … position tern last occurs: attributes)])23Last time• Given Inverted index, how compute theresults for a query– Merge-based algorithms• What data structure use for invertedindex?– Hash table– B+ tree4This time• How construct inverted index from “raw”document collection?– Don’t worry about getting into final indexdata structure35Preliminary decisions• Define “document”: level of granularity?– Book versus Chapter of book– Individual html files versus combined filesthat composed one Web page• Define “term”– Include phrases?• How determine which adjacent words -- or all?– Stop words?6Pre-processing text documents• Give each document a unique ID: docID• Tokenize text– Distinguish terms from punctuation, etc.• Normalize tokens– Stemming• Remove endings: plurals, possessives, “ing”,– cats -> cat; accessible -> access• Porter’s algorithm (1980)– Lemmatization• Use knowledge of language forms– am, are, is -> be• More sophisticated than stemming(See Intro IR Chapter 2 )47Construction of posting lists• Overview– “document” now means preprocessed document– One pass through collection of documents– Gather postings for each document– Reorganize for final set of lists: one for each term• Look at algorithms when can’t fit everything inmemory– Main cost disk page reads and writes• Terminology: disk block = disk page8Memory- disk management• Have buffer in main memory– Size = B disk pages– Read from disk to buffer, page at a time• Disk cost = 1– Write from buffer to disk, page at at time• Disk cost = 159Algorithm: “Block Sort-based”1. Repeat until entire collection read:– Read documents, building (term, <attributes>, doc) tuples until buffer full– Sort tuples in buffer by term value as primary,doc as secondary• Note tuples for one doc already together-use sortalgorithm that keeps appearance order for = keys– Build posting lists for each unique term in buffer• Re-writing of sorted info– Write partial index to disk pages2. Merge partial indexes on disk into full index10Merging Lists: General technique• K sorted lists on disk to merge into one• If K+1 <= B:– Dedicate one buffer page for output– Dedicate one buffer page for each list to merge• K input buffer pages– Algorithm:Fill 1 buffer page from each list on diskRepeat until merge complete:Merge buffer input pages to output buffer pgWhen output buffer pg full, write to diskWhen input buffer pg empty, refill from its list611• If K+1 > B:– Dedicate one buffer page for output– B-1 buffer page for input from different lists– Call lists to merge level-0 lists– Algorithm j=0 Repeat until one level-j list: Group level-j lists into groups of B-1 lists // K/(B-1) gps for j=0 For each group, merge into one level-(j+1) list by: {Fill 1 buffer page from each level-j list in group Repeat until level-j merge complete:Merge buffer input pages to output buffer pgWhen output buffer pg full, write to group’s level-(j+1) list on diskWhen input buffer pg empty, refill from its list }j++12Application to“Blocked Sort-based”• Have to merge partial indexes• Partial posting lists for one term mustbe merged– Concatenate• Keep documents sorted within posting list• If postings for one document brokenacross partial lists, must merge713Aside: External Sorting• Divide list into size-B blocks ofcontiguous entries• Read each block into buffer, sort, writeout to disk• Now have L/B sorted sub-listswhere L is size of list in disk pages• Merge sorted sub-lists into one list• Number of disk page read/writes?14Distributed Algorithms• Can easily assign different documentsto different machines• Efficient?• Goals– Keep all machines busy– Be able to replace badly-behavedmachines seamlessly!815How build inverted index a la GoogleStart by building forward index:• Split up collection into pieces to be worked on bydifferent machines• Split up terms into ranges of terms– designate a “barrel” for each range (64 in 1998)– “barrels” conceptually organize data storage– Each machine working on piece of collection has its own barrels– each barrel is too large for main memory• On one machine:For each doc, put each (term, <attributes>, doc) tuple in barreldesignated for term• This forward index still organized by doc ID– Reading doc by doc16Forward index to inverted index• Each barrel in r pieces if r processors workedon forward index• Process each barrel - one processor– Sort each barrel piece by terms– Construct partial posting lists for barrel piece– Merge to make posting lists for terms in range• Results in inverted index in pieces by termranges• Barrel could be worked on by >1 processor917Distributed query processing• Parallelize by distributing documentsrandomly to pieces of index– Pool of machines for each - choose one– Why random?18Load balancing and reliability• Scheduler machines assign tasks topools of machines and

View Full Document


School:
Email:
New Password:
Confirm Password:

Princeton COS 435 - Building the index

Sign up for free to view:

Please select your school