Yale CPSC 424 - Structured and Unstructured Overlays for Distributed Data

Unformatted text preview:

1Structured and Unstructured Overlaysfor Distributed DataArvind KrishnamurthyYale UniversityJoint work: Chi Zhang, Anthony Young, Randy WangEvolution of Overlays in P2P SystemsAd-hoc Overlays(Freenet, Gnutella)Structured Overlays(Chord, Pastry, CAN)Unstructured Overlays(Narada, NICE)Implicit routingProbabilistic load-balanceSmall routing stateHighly scalable systemsExplicit routing protocolMedium-scale systemsComplex QueriesDistribute with locality?Performance-sensitivityDynamic changes in quality?2Motivating Factors 1) Complex data Massive, distributed data collections (web documents, multimediafiles) Dynamic data (sensor data, network monitoring, etc.) High-dimensional (or at least multi-dimensional)2) Complex queries Similarity searches, nearest-neighbor queries Range queries3) Locality-preserving mapping for better performanceApproach #1: Use DHTs Store attributes of a multi-dimensional object in a DHT Separate entry made for each attribute to allow for queries on individual attributes Can support only exact searches Related objects are hashed to arbitrary locations Exemplified by PIER; PIER supports arbitrary SQL commandsDHTDHTx = 100y = 200Data: Rx:100ÆRy:200ÆR3Approach #2: Hashless DHTs Exemplified by Mercury Maintain a multi-dimensional index as a set of one-dimensional indexes Each index stored in a Chord-like ring without hashingData skews require:1) processor redistibution2) adjustments to “fingers”Support for Multi-dimensional Objects Maintain a collection of rings for one-dimensional indexes Objects are published in all rings Query sent to only one ring Can support complex queries, but incurs storage overhead; not suitable for high-dimensional data[0, 80)[240, 320)[80, 160)[160, 240)[105, 190)[190, 320)[0, 105)RxRyx = 100y = 200Data item50 ≤ x ≤ 150150 ≤ y ≤ 250Query4Approach #3: Hashless CAN Content Addressable Networks (CAN) Manage objects in a multi-dimensional cartesian space Typically, object’s position is obtained by hashing its individual attributes Store high-dimensional objects in unhashed CAN Related objects are adjacent in the CAN space Each processor assigned a zone Exemplified by pSearchx = 10y = 20Data itemx = 11y = 19Data itemPitfalls of Hashless CAN Problem #1: Dimensionality mismatch Dimensionality of CAN is set to the dimensionality of object space Engineering considerations for CAN: Dimensionality should be typically log(N) (where N is number of processors); typically 10-20 Each CAN node has 2*d neighbors in average CAN node also needs to keep track of neighbor’s neighbors for fault-recovery High dimensionality datasets (such as 50 or more) would be impractical to implement using CAN5Pitfalls of Hashless CAN (contd.) Problem #2: Unbalanced load due to skewed data pSearch solution: balance load by having nodes join high-density regions after sampling Sampling when a node joins is not sufficient; need to constantlymonitor changes in load and repartition spacePitfalls of Hashless CAN (contd.) Problem #3: Imbalances in routing state/query processing Some zones have a large number of neighbors Become popular intermediate nodes for routing and processing queries6Summary of System Requirements System should be able to store complex objects Be capable of supporting similarity searches, range queries Should minimize: Storage overhead Routing state Routing costs Should balance: Data distribution Routing state Query/routing costsSkipIndex: Overview Use some hierarchical tree data structure which is used in centralized settings to manage high-dimensional data Distribute tree while retaining locality Nearby tree-nodes maintained by nearby processors Processors manage sub-trees Do not navigate the tree in a top-down manner Instead jump from one part of the tree to another part of the tree using “long” pointers Employ a load-balancing algorithm to ensure that data distribution is balanced7K-D Tree Geometrical partitioning of high-dimensional space based on current data distributionTree Expansion Overloaded regions get split; flexibility in choosing the splitting dimension and position Tree could be unbalanced in depth8Distributing a K-D Tree Take the nodes (or regions) of a K-D tree and distribute it across processors Distributed tree will notbe navigated top-down Consider the nodes of the tree arranged by pre-order traversalR1R2R3R4R5R6R1R2R3R4R5R6R4Skip-List Store the nodes of the K-D Tree in a distributed Skip List Skip List is a randomized balanced-depth data structure Elements are “ordered” at each level of the Skip List Can find any item by starting at the top-level skip list and navigating down; at each element a “comparison” is performedR1R2 R3R4R5R6R1R2R2Level 1Level 2Level 39Ordering of Regions Consider a region being split into two smaller regions Let R1 appear before R2 in one of the dimensions of the coordinate space Then any subregion of R1 is ordered to be “before” any subregion of R2 Two regions can be ordered if their history of region-splits are availableIncreasing coordinatesR1 R2R3R4R1R4Skip Graphs Generalization of a Skip List Adds redundancy for fault-tolerance and to avoid hot-spotsR1R2 R3R4R5R6R1R2R2Level 1Level 2Level 3R6R3R5R3R6R5R410Partial Tree Each skip graph element: corresponds to a tree node or region is maintained by a distinct processor keeps track of its split-history  maintains pointers to peer elements  A processor can construct a “partial tree” based on information regarding its peers Routing: Forward request to a node that is closest to the target region without overshooting it Each element needs to keep track of only log-n neighbors and can perform routing in log-n stepsDistributing K-D Tree Using Skip Graphs Each region is aware only of its peer regions in the skip graphR1R2R4R1R2 R411Routing a Query A region routes a query to its peer that is closest to the target query without overshooting the queryR1R2R4R1R2 R4Routing a QueryR1R2R4R1R2 R4 R2 is ordered to be “before” the query, while R4 appears “after” the query point in the ordering R1 decides to route the query to R2 and not to R412Routing a Query R2 peers with R3 which contains the query point Has information to route the query to its final destination


View Full Document

Yale CPSC 424 - Structured and Unstructured Overlays for Distributed Data

Download Structured and Unstructured Overlays for Distributed Data
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Structured and Unstructured Overlays for Distributed Data and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Structured and Unstructured Overlays for Distributed Data 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?