Yale CPSC 424 - Structured and Unstructured Overlays for Distributed Data - D2861687

Home> Schools> Yale University> Computer Science (CPSC) > CPSC 424> Structured and Unstructured Overlays for Distributed Data

Yale CPSC 424 - Structured and Unstructured Overlays for Distributed Data

School name Yale University

Course Cpsc 424- Parallel Programming Techniques

Pages 14

Download Save

Unformatted text preview:

1Structured and Unstructured Overlaysfor Distributed DataArvind KrishnamurthyYale UniversityJoint work: Chi Zhang, Anthony Young, Randy WangEvolution of Overlays in P2P SystemsAd-hoc Overlays(Freenet, Gnutella)Structured Overlays(Chord, Pastry, CAN)Unstructured Overlays(Narada, NICE)Implicit routingProbabilistic load-balanceSmall routing stateHighly scalable systemsExplicit routing protocolMedium-scale systemsComplex QueriesDistribute with locality?Performance-sensitivityDynamic changes in quality?2Motivating Factors 1) Complex data Massive, distributed data collections (web documents, multimediafiles) Dynamic data (sensor data, network monitoring, etc.) High-dimensional (or at least multi-dimensional)2) Complex queries Similarity searches, nearest-neighbor queries Range queries3) Locality-preserving mapping for better performanceApproach #1: Use DHTs Store attributes of a multi-dimensional object in a DHT Separate entry made for each attribute to allow for queries on individual attributes Can support only exact searches Related objects are hashed to arbitrary locations Exemplified by PIER; PIER supports arbitrary SQL commandsDHTDHTx = 100y = 200Data: Rx:100ÆRy:200ÆR3Approach #2: Hashless DHTs Exemplified by Mercury Maintain a multi-dimensional index as a set of one-dimensional indexes Each index stored in a Chord-like ring without hashingData skews require:1) processor redistibution2) adjustments to “fingers”Support for Multi-dimensional Objects Maintain a collection of rings for one-dimensional indexes Objects are published in all rings Query sent to only one ring Can support complex queries, but incurs storage overhead; not suitable for high-dimensional data[0, 80)[240, 320)[80, 160)[160, 240)[105, 190)[190, 320)[0, 105)RxRyx = 100y = 200Data item50 ≤ x ≤ 150150 ≤ y ≤ 250Query4Approach #3: Hashless CAN Content Addressable Networks (CAN) Manage objects in a multi-dimensional cartesian space Typically, object’s position is obtained by hashing its individual attributes Store high-dimensional objects in unhashed CAN Related objects are adjacent in the CAN space Each processor assigned a zone Exemplified by pSearchx = 10y = 20Data itemx = 11y = 19Data itemPitfalls of Hashless CAN Problem #1: Dimensionality mismatch Dimensionality of CAN is set to the dimensionality of object space Engineering considerations for CAN: Dimensionality should be typically log(N) (where N is number of processors); typically 10-20 Each CAN node has 2*d neighbors in average CAN node also needs to keep track of neighbor’s neighbors for fault-recovery High dimensionality datasets (such as 50 or more) would be impractical to implement using CAN5Pitfalls of Hashless CAN (contd.) Problem #2: Unbalanced load due to skewed data pSearch solution: balance load by having nodes join high-density regions after sampling Sampling when a node joins is not sufficient; need to constantlymonitor changes in load and repartition spacePitfalls of Hashless CAN (contd.) Problem #3: Imbalances in routing state/query processing Some zones have a large number of neighbors Become popular intermediate nodes for routing and processing queries6Summary of System Requirements System should be able to store complex objects Be capable of supporting similarity searches, range queries Should minimize: Storage overhead Routing state Routing costs Should balance: Data distribution Routing state Query/routing costsSkipIndex: Overview Use some hierarchical tree data structure which is used in centralized settings to manage high-dimensional data Distribute tree while retaining locality Nearby tree-nodes maintained by nearby processors Processors manage sub-trees Do not navigate the tree in a top-down manner Instead jump from one part of the tree to another part of the tree using “long” pointers Employ a load-balancing algorithm to ensure that data distribution is balanced7K-D Tree Geometrical partitioning of high-dimensional space based on current data distributionTree Expansion Overloaded regions get split; flexibility in choosing the splitting dimension and position Tree could be unbalanced in depth8Distributing a K-D Tree Take the nodes (or regions) of a K-D tree and distribute it across processors Distributed tree will notbe navigated top-down Consider the nodes of the tree arranged by pre-order traversalR1R2R3R4R5R6R1R2R3R4R5R6R4Skip-List Store the nodes of the K-D Tree in a distributed Skip List Skip List is a randomized balanced-depth data structure Elements are “ordered” at each level of the Skip List Can find any item by starting at the top-level skip list and navigating down; at each element a “comparison” is performedR1R2 R3R4R5R6R1R2R2Level 1Level 2Level 39Ordering of Regions Consider a region being split into two smaller regions Let R1 appear before R2 in one of the dimensions of the coordinate space Then any subregion of R1 is ordered to be “before” any subregion of R2 Two regions can be ordered if their history of region-splits are availableIncreasing coordinatesR1 R2R3R4R1R4Skip Graphs Generalization of a Skip List Adds redundancy for fault-tolerance and to avoid hot-spotsR1R2 R3R4R5R6R1R2R2Level 1Level 2Level 3R6R3R5R3R6R5R410Partial Tree Each skip graph element: corresponds to a tree node or region is maintained by a distinct processor keeps track of its split-history  maintains pointers to peer elements  A processor can construct a “partial tree” based on information regarding its peers Routing: Forward request to a node that is closest to the target region without overshooting it Each element needs to keep track of only log-n neighbors and can perform routing in log-n stepsDistributing K-D Tree Using Skip Graphs Each region is aware only of its peer regions in the skip graphR1R2R4R1R2 R411Routing a Query A region routes a query to its peer that is closest to the target query without overshooting the queryR1R2R4R1R2 R4Routing a QueryR1R2R4R1R2 R4 R2 is ordered to be “before” the query, while R4 appears “after” the query point in the ordering R1 decides to route the query to R2 and not to R412Routing a Query R2 peers with R3 which contains the query point Has information to route the query to its final destination

View Full Document


School:
Email:
New Password:
Confirm Password:

Yale CPSC 424 - Structured and Unstructured Overlays for Distributed Data

Sign up for free to view:

Please select your school