DOC PREVIEW
Berkeley COMPSCI 294 - Pangaea

This preview shows page 1-2-3-4-5 out of 14 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 14 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 14 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 14 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 14 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 14 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 14 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Taming Aggressive Replication in the Pangaea Wide-area File SystemY. Saito, C. Kaamanolis, M. Karlsson, M. MahalingamPresented by Jason WaddlePangaea: Wide-area File SystemoSupport the daily storage needs of distributed users.oEnable ad-hoc data sharing.Pangaea Design GoalsI.SpeedHide wide-area latency,file access time ~ local file systemII. Availability & autonomyAvoid single point-of-failureAdapt to churnIII. Network economyMinimize use of wide-area networkExploit physical localityPangaea Assumptions (Non-goals)Servers are trustedWeak data consistency is sufficient (consistency in seconds)Symbiotic Design Symbiotic DesignAutonomousEach server operates when disconnected from network.Symbiotic DesignAutonomousEach server operates when disconnected from network.CooperativeWhen connected, servers cooperate to enhance overall performance and availability.Pervasive Replication Replicate at file/directory level Aggressively create replicas: whenever a file or directory is accessed No single “master” replica A replica may be read / written at any time Replicas exchange updates in a peer-to-peer fashionGraph-based Replica Management Replicas connected in a sparse, strongly-connected, random graph Updates propagate along edges Edges used for discovery and removalBenefits of Graph-based ApproachInexpensiveGraph is sparse, adding/removing replicas O(1) Available update distributionAs long as graph is connected, updates reach every replica Network economyHigh connectivity for close replicas,build spanning tree along fast edgesOptimistic Replica Coordination Aim for maximum availability over strong data-consistency Any node issues updates at any time Update transmission and and conflict resolution in backgroundOptimistic Replica Coordination “Eventual consistency” (~ 5s in tests) No strong consistency guarantees:no support for locks, lock-files, etc.Pangaea StructureRegion(<5ms RTT)Server or NodeServer StructureNFS clientUser spaceKernel spaceNFS protocol handlermembershipReplication enginelogPangaea serverI/O request(application)Inter-node communicationServer ModulesNFS protocol handlerReceives requests from apps, updates local replicas, generates requests toServer Modules NFS protocol handlerReceives requests from apps, updates local replicas, generates requests to Replication engineAccepts local and remote requestsModifies replicasForwards requests to other nodesServer ModulesNFS protocol handlerReceives requests from apps, updates local replicas, generates requests to Replication engineAccepts local and remote requestsModifies replicasForwards requests to other nodes Log moduleTransaction-like semantics for local updatesServer Modules Membership module maintains: List of regions, their members, estimated RTT between regions Location of root directory replicas Information coordinated by gossiping “Landmark” nodes bootstrap newly joining nodesMaintaining RTT information: main scalability bottleneckFile System Structure Gold replicas Listed in directory entries Form clique in replica graph Fixed number (e.g., 3) All replicas (gold and bronze) Unidirectional edges to all gold replicas Bidirectional peer-edges Backpointer to parent directoryFile System Structure/joe/foo/joeFile System Structurestruct Replicafid: FileIDts: TimeStampvv: VersionVectorgoldPeers: Set(NodeID)peers: Set(NodeID)backptrs: Set(FileID, String)struct DirEntryfname: Stringfid: FileIDdownlinks: Set(NodeID)ts: TimeStampFile CreationSelect locations for g gold replicas (e.g., g=3)One on current serverOthers on random servers from different regions Create entry in parent directory Flood updatesUpdate to parent directoryFile contents (empty) to gold replicasReplica Creation Recursively get replicas for ancestor directories Find a close replica (shortcutting) Send request to the closest gold replica Gold replica forwards request to its neighbor closest to requester, who then sendsReplica Creation Select m peer-edges (e.g., m=4) Include a gold replica (for future shortcutting) Include closest neighbor from a random gold replica Get remaining nodes from random walks starting at a random gold replica Create m bidirectional peer-edgesBronze Replica Removal To recover disk spaceUsing GD-Size algorithm, throw out largest, least-accessed replica Drop useless replicasToo many updates before an access (e.g., 4) Must notify peer-edges of removal; peers use random walk to choose new edgeReplica Updates Flood entire file to replica graph neighbors Updates reach all replicas as long as the graph is strongly connected Optional: user can block on update until all neighbors reply (red-button mode) Network economy???Optimized Replica Updates Send only differences (deltas)Include old timestamp, new timestampOnly apply delta to replica if old timestamp matchesRevert to full-content transfer if necessary Merge deltas when possible Optimized Replica Updates Don’t send large (e.g., > 1KB) updates to each of m neighbors  Instead, use harbingers to dynamically build a spanning-tree update graphHarbinger: small message with update’s timestamps Send updates along spanning-tree edges Happens in two phasesOptimized Replica Updates Exploit Physical Topology Before pushing a harbinger to a neighbor, add a random delay ~ RTT (e.g., 10*RTT)Harbingers propagate down fastest links firstDynamically builds an update spanning-tree with fast edgesUpdate Example (Phase 1)ABCD EFUpdate Example (Phase 1)ABCD EFUpdate Example (Phase 1)ABCD EFUpdate Example (Phase 1)ABCD EFUpdate Example (Phase 1)ABCD EFUpdate Example (Phase 1)ABCD EFUpdate Example (Phase 2)ABCD EFUpdate Example (Phase 2)ABCD EFUpdate Example (Phase 2)ABCD EFConflict Resolution Use a combination of version vectors and last-writer wins to resolve If timestamps mismatch, full-content is transferredMissing update: just overwrite replicaRegular File Conflict(Three Solutions)1) Last-writer-wins, using update timestamps• Requires server clock synchronization2) Concatenate both updates • Make the user fix it3) Possibly application-specific resolutionDirectory Conflictalice$ mv /foo /alice/foo bob$ mv /foo /bob/fooDirectory Conflictalice$ mv /foo /alice/foo bob$ mv /foo /bob/foo/alice replica set/bob


View Full Document

Berkeley COMPSCI 294 - Pangaea

Documents in this Course
"Woo" MAC

"Woo" MAC

11 pages

Load more
Download Pangaea
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Pangaea and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Pangaea 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?