Yale CPSC 424 - Distributed Indexing - D1395715

Home> Schools> Yale University> Computer Science (CPSC) > CPSC 424> Distributed Indexing

Yale CPSC 424 - Distributed Indexing

Course Cpsc 424- Parallel Programming Techniques

Pages 6

Download Save

Unformatted text preview:

1Distributed IndexingArvind KrishnamurthyFall 2004Motivation Distributed Indexing generalizes in two ways: System stores objects with multiple attributes Objects can be retrieved by using one or many of the attributes Should be able to support complex operations such as: Nearest-neighbor queries Range queries Motivating applications: Large number of documents Networking probing information Sensor networks gathering dataApproaches Three current solutions: PIER system (no range queries or nearest-neighbor queries) pSearch system Mercury system SkipIndex system PIER: Uses a hybrid of “unstructured” and “structured” networks Uses the “unstructured” network without guarantees Uses the “structured” network to guarantee retrievalUnstructured Networks Ad-hoc topology Queries are flooded for bounded number of hops No guarantees on recall E.g. Gnutella and KazaaQuery: “xyz”xyzxyz Distributed Hash Tables (DHTs) Hash table interface: put(key,item), get(key) O(log n) hops  Guarantees on recallStructured NetworksK IK IK IK IK IK IK IK IK Iput(K1,I1)(K1,I1)get (K1)I1Keyword Search using DHTsInverted Lists hashed by keyword (term) in the DHTK IK IK IK IK IK IK IK IQuery: “T1AND T2”{I1,I2}{I2}(h(T1), {I1,I2})(h(T2), {I2,I3})(h(T3), {I4,I5})K I2PIER ApproachFlood-based NetworkDHTFlood QueryVery Few Results?DHT QueryMore ResultsSimpler alternative to either optimizing unstructured or structured networks.(Partial Index on Rare Items)(All items)Unstructured Networks: Queries with Small Result SetsSmall result set → items have few copies in networkQuery LatencyMany resultsQueries that return few results have poor response timesFew ResultsDHT-based Search Advantages: Avoid flooding query in network Guarantee recall (critical for small result sets) Disadvantages: Hashing inverted lists into the DHT is costly So is intersecting inverted lists at query time Infeasible for Google-like datasets (IPTPS ’03) Feasible for querying rare items: Queries with ≤10 results ship 7x fewer inverted list entries compared to the average query Query optimization can reduce communication overhead: intersect rare terms firstMercury: Supporting Range Queries Start with single-attribute data objects Use DHT’s without hash functions Results in load-imbalance Address load-imbalance Results in longer routing distances Modify routing tables Generalize to multiple attributesUsing DHTs for Range QueriesNo cryptographic hashing for key No cryptographic hashing for key ÆÆidentifieridentifierQuery: 6 ≤ x ≤ 13key = 6 Î 0xabkey = 7 Î 0xd3…key = 13 Î 0x120x000x700xf00x300xb00x200xc00xd00x100xe00xa00x900x800x400x500x60Query: 6 ≤ x ≤ 13ARB1Slide 12ARB1 difference between cryptographic hashing and not hashing -- animate... hash(x) = blah, hash(y) = blah_2Ashwin, 8/25/20043Using DHTs for Range Queries Nodes in popular regions can be overloaded Load imbalance!DHTs with Load Balancing Mercury load balancing strategy Re-adjust responsibilities Range ownerships are skewed!skewed!DHTs with Load Balancing0x000x300xa00xb00xd0Finger pointersget skewed!PopularRegion0x900x800xf00xe0 Each routing hop may not reduce node-space by half! Î no log(n) hop guaranteeIdeal Link Structure0x000x300xa00xb00xd0PopularRegion0x900x800xf00xe0Mercury’s SolutionValuesNodes If we had the above information… For finger i Estimate value v for which 2ith node is responsible Need to establish links based on nodenode--distancedistance48v4v8MercuryValuesNodes Need to establish links based on nodenode--distancedistance48v4v8ValuesNode-densityPiece-wise linear approximation Histogram4Histogram Maintenance0xf00x000x700xa00xb00xd00x900x800xe00x30Request sample(Range, density) Measure node-density locally Gossip about it!ValuesNode-density(Range, density)(Range, density)ARB2Load Balancing Basic idea –leaveleave--join join  “light” nodes leave  Re-join near “heavy” nodes; split the range of the heavier node0 10 2025 35 45 606570 75 85Load histogramLoadLoad Balancing Basic idea: leave-rejoin Steps Find average, check if heavy or light Light nodes perform a leave and rejoin0 10 2025 35 45 606570 75 85AverageHeavyLight15 72.5Load histogramLoad[0, 80)[240, 320)[80, 160)[160, 240)[105, 210)[210, 320)[0, 105)Multi-attribute Range QueriesRxRy50 ≤ x ≤ 150150 ≤ y ≤ 250Queryx = 100y = 200Data item Send data to all rings Send query to only ringSlide 19ARB2 "push" samples to k_2 nodesAshwin,

View Full Document


School:
Email:
New Password:
Confirm Password:

Yale CPSC 424 - Distributed Indexing

Sign up for free to view:

Please select your school