Appears in NSDI 04 1 Hybrid Global Local Indexing for Efficient Peer to Peer Information Retrieval Chunqiang Tang and Sandhya Dwarkadas Computer Science Department University of Rochester sarrmor sandhya cs rochester edu Abstract Content based full text search still remains a particularly challenging problem in peer to peer P2P systems Traditionally there have been two index partitioning structures partitioning based on the document space or partitioning based on keywords The former requires search of every node in the system to answer a query whereas the latter transmits a large amount of data when processing multi term queries In this paper we propose eSearch a P2P keyword search system based on a novel hybrid indexing structure In eSearch each node is responsible for certain terms Given a document eSearch uses a modern information retrieval algorithm to select a small number of top important terms in the document and publishes the complete term list for the document to nodes responsible for those top terms This selective replication of term lists allows a multi term query to proceed local to the nodes responsible for query terms We also propose automatic query expansion to alleviate the degradation of quality of search results due to the selective replication overlay source multicast to reduce the cost of disseminating term lists and techniques to balance term list distribution across nodes eSearch is scalable and efficient and obtains search results as good as state of the art centralized systems Despite the use of replication eSearch actually consumes less bandwidth than systems based on keyword partitioning when publishing metadata for a document During a retrieval operation it searches only a small number of nodes and typically transmits a small amount of data 3 3KB that is independent of the size of the corpus and grows slowly logarithmically with the number of nodes in the system eSearch s efficiency comes at a modest storage cost 6 8 times that of systems based on keyword partitioning This cost can be further reduced by adopting index compression or pruning techniques 1 Introduction Peer to Peer P2P systems have gained tremendous interest from both the user and research community in the past several years First generation systems such as Gnutella and KaZaA are already prevalent and secondgeneration systems such as PAST 32 and CFS 14 based on Distributed Hash Tables DHTs are under serious development e g the IRIS project 19 With a gigantic amount of information in these systems it would be impossible for users to remember or even know the place or precise ID of the desired data The capability to retrieve documents using content based full text search would greatly improve the usability of these systems Although it is possible to build a dedicated search engine to index contents in P2P systems in a way similar to how Google indexes the Web a P2P search system built on top of the same nodes already used in the P2P storage system is particularly attractive because of its low cost ease of deployment availability and scalability In this paper we study the challenging problem of building P2P keyword search systems To facilitate the retrieval of documents a distributed not necessarily P2P search system places information regarding the occurrence of terms words or phrases in documents in the form of metadata at certain places in the system The metadata placement strategy in existing systems are based on either local or global indexing 42 In local indexing see Figure 1 i metadata are partitioned based on the document space The complete term list of a document is stored on a node A term list X a c means that document X contains terms a and c During a retrieval operation the query is broadcast to all nodes Since a node has the complete term list for documents that it is responsible for it can compute the relevance between the query and its documents without consulting others The drawback however is that every node is involved in processing every query rendering systems of this type unscalable Gnutella and search engines such as AllTheWeb www alltheweb com 30 are based on variants of local indexing In global indexing see Figure 1 ii metadata are distributed based on terms Each node stores the complete inverted list of some terms An inverted list a X Z indicates that term a appears in document X and Z To answer a query consisting of multiple terms the query Appears in NSDI 04 is sent to nodes responsible for those terms Their inverted lists are transmitted over the network so that a join to identify documents that contain multiple query terms can be performed The communication cost for a join grows proportionally with the length of the inverted lists i e the size of the corpus Most recent proposals for P2P keyword search 16 20 28 39 are based on global indexing but with enhancements to reduce the communication cost for instance using Bloom filters to summarize the inverted lists or incrementally transmitting the inverted lists and terminating early if sufficient results have already been obtained In the following we will simply refer to these systems as Global P2P systems Challenging conventional wisdom that uses either local or global indexing we propose a hybrid indexing structure to combine their benefits while avoiding their limitations The basic tenet of our approach is selective metadata replication Like global indexing hybrid indexing distributes metadata based on terms see Figure 1 iii Each node j is responsible for the inverted list of some term t In addition for each document D in the inverted list for term t node j also stores the complete term list for document D Given a multi term query the query is sent to nodes responsible for those terms Each of these nodes then does a local search without consulting others since it has the complete term list for documents in its inverted list Our system based on hybrid indexing is called eSearch It uses a Distributed Hash Table DHT to map a term to a node where the inverted list for the term is stored We chose Chord 38 for eSearch but other DHTs such as Pastry CAN and Tapestry can also be used without major changes to our design When naively implemented eSearch s search efficiency obviously comes at the expense of publishing more metadata requiring more communication and storage We propose several optimizations that reduce communication by up to 97 and storage by up to 90 compared with a naive hybrid indexing Below we outline one
View Full Document
Unlocking...