Unformatted text preview:

1Distributed IndexingArvind KrishnamurthyFall 2004Motivation Distributed Indexing generalizes in two ways: System stores objects with multiple attributes Objects can be retrieved by using one or many of the attributes Should be able to support complex operations such as: Nearest-neighbor queries Range queries Motivating applications: Large number of documents Networking probing information Sensor networks gathering dataApproaches Three current solutions: PIER system (no range queries or nearest-neighbor queries) pSearch system Mercury system SkipIndex system PIER: Uses a hybrid of “unstructured” and “structured” networks Uses the “unstructured” network without guarantees Uses the “structured” network to guarantee retrievalUnstructured Networks Ad-hoc topology Queries are flooded for bounded number of hops No guarantees on recall E.g. Gnutella and KazaaQuery: “xyz”xyzxyz Distributed Hash Tables (DHTs) Hash table interface: put(key,item), get(key) O(log n) hops  Guarantees on recallStructured NetworksK IK IK IK IK IK IK IK IK Iput(K1,I1)(K1,I1)get (K1)I1Keyword Search using DHTsInverted Lists hashed by keyword (term) in the DHTK IK IK IK IK IK IK IK IQuery: “T1AND T2”{I1,I2}{I2}(h(T1), {I1,I2})(h(T2), {I2,I3})(h(T3), {I4,I5})K I2PIER ApproachFlood-based NetworkDHTFlood QueryVery Few Results?DHT QueryMore ResultsSimpler alternative to either optimizing unstructured or structured networks.(Partial Index on Rare Items)(All items)Unstructured Networks: Queries with Small Result SetsSmall result set → items have few copies in networkQuery LatencyMany resultsQueries that return few results have poor response timesFew ResultsDHT-based Search Advantages: Avoid flooding query in network Guarantee recall (critical for small result sets) Disadvantages: Hashing inverted lists into the DHT is costly So is intersecting inverted lists at query time Infeasible for Google-like datasets (IPTPS ’03) Feasible for querying rare items: Queries with ≤10 results ship 7x fewer inverted list entries compared to the average query Query optimization can reduce communication overhead: intersect rare terms firstMercury: Supporting Range Queries Start with single-attribute data objects Use DHT’s without hash functions Results in load-imbalance Address load-imbalance Results in longer routing distances Modify routing tables Generalize to multiple attributesUsing DHTs for Range QueriesNo cryptographic hashing for key No cryptographic hashing for key ÆÆidentifieridentifierQuery: 6 ≤ x ≤ 13key = 6 Î 0xabkey = 7 Î 0xd3…key = 13 Î 0x120x000x700xf00x300xb00x200xc00xd00x100xe00xa00x900x800x400x500x60Query: 6 ≤ x ≤ 13ARB1Slide 12ARB1 difference between cryptographic hashing and not hashing -- animate... hash(x) = blah, hash(y) = blah_2Ashwin, 8/25/20043Using DHTs for Range Queries Nodes in popular regions can be overloaded Load imbalance!DHTs with Load Balancing Mercury load balancing strategy Re-adjust responsibilities Range ownerships are skewed!skewed!DHTs with Load Balancing0x000x300xa00xb00xd0Finger pointersget skewed!PopularRegion0x900x800xf00xe0 Each routing hop may not reduce node-space by half! Î no log(n) hop guaranteeIdeal Link Structure0x000x300xa00xb00xd0PopularRegion0x900x800xf00xe0Mercury’s SolutionValuesNodes If we had the above information… For finger i Estimate value v for which 2ith node is responsible Need to establish links based on nodenode--distancedistance48v4v8MercuryValuesNodes Need to establish links based on nodenode--distancedistance48v4v8ValuesNode-densityPiece-wise linear approximation Histogram4Histogram Maintenance0xf00x000x700xa00xb00xd00x900x800xe00x30Request sample(Range, density) Measure node-density locally Gossip about it!ValuesNode-density(Range, density)(Range, density)ARB2Load Balancing Basic idea –leaveleave--join join  “light” nodes leave  Re-join near “heavy” nodes; split the range of the heavier node0 10 2025 35 45 606570 75 85Load histogramLoadLoad Balancing Basic idea: leave-rejoin Steps Find average, check if heavy or light Light nodes perform a leave and rejoin0 10 2025 35 45 606570 75 85AverageHeavyLight15 72.5Load histogramLoad[0, 80)[240, 320)[80, 160)[160, 240)[105, 210)[210, 320)[0, 105)Multi-attribute Range QueriesRxRy50 ≤ x ≤ 150150 ≤ y ≤ 250Queryx = 100y = 200Data item Send data to all rings Send query to only ringSlide 19ARB2 "push" samples to k_2 nodesAshwin,


View Full Document

Yale CPSC 424 - Distributed Indexing

Download Distributed Indexing
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Distributed Indexing and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Distributed Indexing 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?