1Distributed IndexingArvind KrishnamurthyFall 2004Motivation Distributed Indexing generalizes in two ways: System stores objects with multiple attributes Objects can be retrieved by using one or many of the attributes Should be able to support complex operations such as: Nearest-neighbor queries Range queries Motivating applications: Large number of documents Networking probing information Sensor networks gathering dataApproaches Three current solutions: PIER system (no range queries or nearest-neighbor queries) pSearch system Mercury system SkipIndex system PIER: Uses a hybrid of “unstructured” and “structured” networks Uses the “unstructured” network without guarantees Uses the “structured” network to guarantee retrievalUnstructured Networks Ad-hoc topology Queries are flooded for bounded number of hops No guarantees on recall E.g. Gnutella and KazaaQuery: “xyz”xyzxyz Distributed Hash Tables (DHTs) Hash table interface: put(key,item), get(key) O(log n) hops Guarantees on recallStructured NetworksK IK IK IK IK IK IK IK IK Iput(K1,I1)(K1,I1)get (K1)I1Keyword Search using DHTsInverted Lists hashed by keyword (term) in the DHTK IK IK IK IK IK IK IK IQuery: “T1AND T2”{I1,I2}{I2}(h(T1), {I1,I2})(h(T2), {I2,I3})(h(T3), {I4,I5})K I2PIER ApproachFlood-based NetworkDHTFlood QueryVery Few Results?DHT QueryMore ResultsSimpler alternative to either optimizing unstructured or structured networks.(Partial Index on Rare Items)(All items)Unstructured Networks: Queries with Small Result SetsSmall result set → items have few copies in networkQuery LatencyMany resultsQueries that return few results have poor response timesFew ResultsDHT-based Search Advantages: Avoid flooding query in network Guarantee recall (critical for small result sets) Disadvantages: Hashing inverted lists into the DHT is costly So is intersecting inverted lists at query time Infeasible for Google-like datasets (IPTPS ’03) Feasible for querying rare items: Queries with ≤10 results ship 7x fewer inverted list entries compared to the average query Query optimization can reduce communication overhead: intersect rare terms firstMercury: Supporting Range Queries Start with single-attribute data objects Use DHT’s without hash functions Results in load-imbalance Address load-imbalance Results in longer routing distances Modify routing tables Generalize to multiple attributesUsing DHTs for Range QueriesNo cryptographic hashing for key No cryptographic hashing for key ÆÆidentifieridentifierQuery: 6 ≤ x ≤ 13key = 6 Î 0xabkey = 7 Î 0xd3…key = 13 Î 0x120x000x700xf00x300xb00x200xc00xd00x100xe00xa00x900x800x400x500x60Query: 6 ≤ x ≤ 13ARB1Slide 12ARB1 difference between cryptographic hashing and not hashing -- animate... hash(x) = blah, hash(y) = blah_2Ashwin, 8/25/20043Using DHTs for Range Queries Nodes in popular regions can be overloaded Load imbalance!DHTs with Load Balancing Mercury load balancing strategy Re-adjust responsibilities Range ownerships are skewed!skewed!DHTs with Load Balancing0x000x300xa00xb00xd0Finger pointersget skewed!PopularRegion0x900x800xf00xe0 Each routing hop may not reduce node-space by half! Î no log(n) hop guaranteeIdeal Link Structure0x000x300xa00xb00xd0PopularRegion0x900x800xf00xe0Mercury’s SolutionValuesNodes If we had the above information… For finger i Estimate value v for which 2ith node is responsible Need to establish links based on nodenode--distancedistance48v4v8MercuryValuesNodes Need to establish links based on nodenode--distancedistance48v4v8ValuesNode-densityPiece-wise linear approximation Histogram4Histogram Maintenance0xf00x000x700xa00xb00xd00x900x800xe00x30Request sample(Range, density) Measure node-density locally Gossip about it!ValuesNode-density(Range, density)(Range, density)ARB2Load Balancing Basic idea –leaveleave--join join “light” nodes leave Re-join near “heavy” nodes; split the range of the heavier node0 10 2025 35 45 606570 75 85Load histogramLoadLoad Balancing Basic idea: leave-rejoin Steps Find average, check if heavy or light Light nodes perform a leave and rejoin0 10 2025 35 45 606570 75 85AverageHeavyLight15 72.5Load histogramLoad[0, 80)[240, 320)[80, 160)[160, 240)[105, 210)[210, 320)[0, 105)Multi-attribute Range QueriesRxRy50 ≤ x ≤ 150150 ≤ y ≤ 250Queryx = 100y = 200Data item Send data to all rings Send query to only ringSlide 19ARB2 "push" samples to k_2 nodesAshwin,
View Full Document