Princeton COS 435 - Using and building the index

Unformatted text preview:

11Using and buildingthe index2Review: Model• Document: sequence of {terms + attributes}• Query: sequence of terms– Can make more complicated: Advanced search• Satisfying: in current search engines,documents “containing” all terms– AND model– “containing” includes anchor text of pointers to thisdoc from other docs• Ranking: wide open function of documentand terms23Review: Posting list• For each document, keep list of termsappearing and attributes for each term– Actually list of positions at which each term occursand attributes for that occurrenceInvert:• For each term, keep list of documents inwhich it appears and attributes– For each document, list of positions at which termoccurs and attributes for each occurrence=> Inverted file/ Inverted index4Consider “advanced search” queriesTo know if satisfied need:Content• Phrases• OR• NOT• Numeric range• Where in pageMeta-data•Language•Geographic region•File format•Date published•From specific domain•Specific licensing rights•Filtered by “safe search”35Retrieval of satisfying documents• Inverted index will allow retrieval forcontent queries• Keep meta-data on docs for meta-dataqueries– Need length even for tf.idf• Issue of efficient retrieval6Basic retrieval algorithms• One term:– look up posting list in (inverted) index• AND of several terms:– Intersect posting lists of the terms: a list merge• OR of several terms:– Union posting lists of the terms• NOT term– If terms AND NOT(terms), take a difference– a list merge (similar to AND)• Proximity– a list merge (similar to AND)47Merging posting lists• Have two lists must coordinate– Find shared entries and do something– Use for intersection and related• Algorithms?– Unsorted lists: read 2nd list over and over- once for eachentry on 1st list– Unsorted lists: build hash table on entry values; insertentries of one list, then other, looking for collisions– Lists sorted by some entry ID: Read both lists in “parallel”• Classic list merge algorithm for sorted lists– If one list sorted, can do binary search of sorted list forentries of other list• Must be able to binary search!• For posting lists, entries are documents– More processing within document postings8Data structure for inverted index?• Sorted array:– binary search IF can keep in memory– High overhead for additions• Hashing– Fast look-up– Collisions• Search trees: B+-trees– Maintain balance - always log look-up time– Can insert and delete59B+- trees• All index entries are at leaves• Order m B+ tree has m to 2m children for eachinterior node• Look up: follow root to leaf by keys in interiornodes• Insert:– find leaf in which belongs– If leaf full, split– Split can propagate up tree• Delete:– Merge or redistribute from too-empty leaf– Merge can propagate up tree10List for “ace”adapted from slide for Database Management Systems by authors R. Ramakrishnan and J. GehrkeExample B+ Treeorder = 2: 2 to 4 search keys per interior nodeace adRootdogdyeeggcad cat dog…dye…… … ….……cabbillbitpigheart soapbat bee bill boy brie call cell…dune eel…List for “ad”List for “bat”… ……List for “eel”………leaves……611• B+ trees used for large data sets– Leaves are file pages on disk– Each interior node is file page on disk– Keep top of tree in buffer (RAM)– M is typically 100; average fanout 133• Height 4 gives ~ 300 Million entries• Save more space: prefix B+ trees for words– Each interior node key is shortest prefix of wordthat need to distinguish which child pointer tofollow– Allows more keys per interior node• higher fanout• Fanout determined by what can fit; keep at least 1/2 full12Another tree structure: tries• Strictly for character strings• Each edge out of node labeled with onecharacter• Follow path root to leaf to spell word• Leaf contain data for word– Usually


View Full Document

Princeton COS 435 - Using and building the index

Download Using and building the index
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Using and building the index and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Using and building the index 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?