Taming Aggressive Replication in the Pangaea Wide-area File SystemY. Saito, C. Kaamanolis, M. Karlsson, M. MahalingamPresented by Jason WaddlePangaea: Wide-area File SystemoSupport the daily storage needs of distributed users.oEnable ad-hoc data sharing.Pangaea Design GoalsI.SpeedHide wide-area latency,file access time ~ local file systemII. Availability & autonomyAvoid single point-of-failureAdapt to churnIII. Network economyMinimize use of wide-area networkExploit physical localityPangaea Assumptions (Non-goals)Servers are trustedWeak data consistency is sufficient (consistency in seconds)Symbiotic Design Symbiotic DesignAutonomousEach server operates when disconnected from network.Symbiotic DesignAutonomousEach server operates when disconnected from network.CooperativeWhen connected, servers cooperate to enhance overall performance and availability.Pervasive Replication Replicate at file/directory level Aggressively create replicas: whenever a file or directory is accessed No single “master” replica A replica may be read / written at any time Replicas exchange updates in a peer-to-peer fashionGraph-based Replica Management Replicas connected in a sparse, strongly-connected, random graph Updates propagate along edges Edges used for discovery and removalBenefits of Graph-based ApproachInexpensiveGraph is sparse, adding/removing replicas O(1) Available update distributionAs long as graph is connected, updates reach every replica Network economyHigh connectivity for close replicas,build spanning tree along fast edgesOptimistic Replica Coordination Aim for maximum availability over strong data-consistency Any node issues updates at any time Update transmission and and conflict resolution in backgroundOptimistic Replica Coordination “Eventual consistency” (~ 5s in tests) No strong consistency guarantees:no support for locks, lock-files, etc.Pangaea StructureRegion(<5ms RTT)Server or NodeServer StructureNFS clientUser spaceKernel spaceNFS protocol handlermembershipReplication enginelogPangaea serverI/O request(application)Inter-node communicationServer ModulesNFS protocol handlerReceives requests from apps, updates local replicas, generates requests toServer Modules NFS protocol handlerReceives requests from apps, updates local replicas, generates requests to Replication engineAccepts local and remote requestsModifies replicasForwards requests to other nodesServer ModulesNFS protocol handlerReceives requests from apps, updates local replicas, generates requests to Replication engineAccepts local and remote requestsModifies replicasForwards requests to other nodes Log moduleTransaction-like semantics for local updatesServer Modules Membership module maintains: List of regions, their members, estimated RTT between regions Location of root directory replicas Information coordinated by gossiping “Landmark” nodes bootstrap newly joining nodesMaintaining RTT information: main scalability bottleneckFile System Structure Gold replicas Listed in directory entries Form clique in replica graph Fixed number (e.g., 3) All replicas (gold and bronze) Unidirectional edges to all gold replicas Bidirectional peer-edges Backpointer to parent directoryFile System Structure/joe/foo/joeFile System Structurestruct Replicafid: FileIDts: TimeStampvv: VersionVectorgoldPeers: Set(NodeID)peers: Set(NodeID)backptrs: Set(FileID, String)struct DirEntryfname: Stringfid: FileIDdownlinks: Set(NodeID)ts: TimeStampFile CreationSelect locations for g gold replicas (e.g., g=3)One on current serverOthers on random servers from different regions Create entry in parent directory Flood updatesUpdate to parent directoryFile contents (empty) to gold replicasReplica Creation Recursively get replicas for ancestor directories Find a close replica (shortcutting) Send request to the closest gold replica Gold replica forwards request to its neighbor closest to requester, who then sendsReplica Creation Select m peer-edges (e.g., m=4) Include a gold replica (for future shortcutting) Include closest neighbor from a random gold replica Get remaining nodes from random walks starting at a random gold replica Create m bidirectional peer-edgesBronze Replica Removal To recover disk spaceUsing GD-Size algorithm, throw out largest, least-accessed replica Drop useless replicasToo many updates before an access (e.g., 4) Must notify peer-edges of removal; peers use random walk to choose new edgeReplica Updates Flood entire file to replica graph neighbors Updates reach all replicas as long as the graph is strongly connected Optional: user can block on update until all neighbors reply (red-button mode) Network economy???Optimized Replica Updates Send only differences (deltas)Include old timestamp, new timestampOnly apply delta to replica if old timestamp matchesRevert to full-content transfer if necessary Merge deltas when possible Optimized Replica Updates Don’t send large (e.g., > 1KB) updates to each of m neighbors Instead, use harbingers to dynamically build a spanning-tree update graphHarbinger: small message with update’s timestamps Send updates along spanning-tree edges Happens in two phasesOptimized Replica Updates Exploit Physical Topology Before pushing a harbinger to a neighbor, add a random delay ~ RTT (e.g., 10*RTT)Harbingers propagate down fastest links firstDynamically builds an update spanning-tree with fast edgesUpdate Example (Phase 1)ABCD EFUpdate Example (Phase 1)ABCD EFUpdate Example (Phase 1)ABCD EFUpdate Example (Phase 1)ABCD EFUpdate Example (Phase 1)ABCD EFUpdate Example (Phase 1)ABCD EFUpdate Example (Phase 2)ABCD EFUpdate Example (Phase 2)ABCD EFUpdate Example (Phase 2)ABCD EFConflict Resolution Use a combination of version vectors and last-writer wins to resolve If timestamps mismatch, full-content is transferredMissing update: just overwrite replicaRegular File Conflict(Three Solutions)1) Last-writer-wins, using update timestamps• Requires server clock synchronization2) Concatenate both updates • Make the user fix it3) Possibly application-specific resolutionDirectory Conflictalice$ mv /foo /alice/foo bob$ mv /foo /bob/fooDirectory Conflictalice$ mv /foo /alice/foo bob$ mv /foo /bob/foo/alice replica set/bob
View Full Document