UH COSC 6360 - CEPH- A SCALABLE, HIGH-PERFORMANCE DISTRIBUTED FILE SYSTEM - D2402604

Home> Schools> University of Houston> Computer Science (COSC) > COSC 6360> CEPH- A SCALABLE, HIGH-PERFORMANCE DISTRIBUTED FILE SYSTEM

DOC PREVIEW

UH COSC 6360 - CEPH- A SCALABLE, HIGH-PERFORMANCE DISTRIBUTED FILE SYSTEM

School name University of Houston

Course Cosc 6360- Operating Systems

Pages 37

This preview shows page 1-2-17-18-19-36-37 out of 37 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 37 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 37 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 37 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 37 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 37 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 37 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 37 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 37 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

CEPH: A SCALABLE, HIGH-PERFORMANCE DISTRIBUTED FILE SYSTEMPaper highlightsSystem objectivesCharacteristics of very large systemsSYSTEM OVERVIEWSystem Architecture (I)System Architecture (II)Key ideasDecoupling data and metadataMetadata managementAutonomic distributed object storageTHE CLIENTPerforming an I/OClient synchronization (I)Client synchronization (II)Namespace operationsTHE MDS CLUSTERStoring metadataDynamic subtree partitioningMapping subdirectories to MDSsDISTRIBUTED OBJECT STORAGEData distribution with CRUSH (I)Data distribution with CRUSH (II)How files are stripedReplicationData safetyCommitting writesRecovery and cluster updatesLow-level storage managementSlide 30PERFORMANCE AND SCALABILITYImpact of replication (I)Impact of replication (II)File system performanceScalabilityImpact of MDS cluster size on latencyConclusionCEPH: A SCALABLE, HIGH-PERFORMANCEDISTRIBUTED FILE SYSTEMS. A. Weil, S. A. Brandt, E. L. MillerD. D. E. Long, C. MaltzahnU. C. Santa Cruz OSDI 2006Paper highlights •Yet another distributed file system using object storage devices•Designed for scalability•Main contributions1. Uses hashing to achieve distributed dynamic metadata management2. Pseudo-random data distribution function replaces object listsSystem objectives•Excellent performance and reliability•Unparallel scalability thanks to–Distribution of metadata workload inside metadata cluster–Use of object storage devices (OSDs)•Designed for very large systems–Petabyte scale (106 gigabytes)Characteristics of very large systems•Built incrementally•Node failures are the norm•Quality and character of workload changes over timeSYSTEM OVERVIEW•System architecture•Key ideas•Decoupling data and metadata•Metadata management•Autonomic distributed object storageSystem Architecture (I)System Architecture (II)•Clients–Export a near-POSIX file system interface•Cluster of OSDs–Store all data and metadata–Communicate directly with clients•Metadata server cluster–Manages the namespace (files + directories)–Security, consistency and coherenceKey ideas•Separate data and metadata management tasks- Metadata cluster does not have object lists•Dynamic partitioning of metadata data tasks inside metadata cluster–Avoids hot spots•Let OSDs handle file migration and replication tasksDecoupling data and metadata•Metadata cluster handles metadata operations•Clients interact directly with OSD for all file I/O•Low-level bloc allocation is delegated to OSDs•Other OSD still require metadata cluster to hold object lists–Ceph uses a special pseudo-random data distribution function (CRUSH)Metadata management•Dynamic Subtree Partitioning –Lets Ceph dynamically share metadata workload among tens or hundreds of metadata servers (MDSs)–Sharing is dynamic and based on current access patterns•Results in near-linear performance scaling in the number of MDSsAutonomic distributed object storage•Distributed storage handles data migration and data replication tasks•Leverages the computational resources of OSDs•Achieves reliable highly-available scalable object storage•Reliable implies no data losses•Highly available implies being accessible almost all the timeTHE CLIENT•Performing an I/O•Client synchronization•Namespace operationsPerforming an I/O•When client opens a file–Sends a request to the MDS cluster–Receives an i-node number, information about file size and striping strategy and a capability•Capability specifies authorized operations on file (not yet encrypted )–Client uses CRUSH to locate object replica–Client releases capability at close timeClient synchronization (I)•POSIX requires–One-copy serializability–Atomicity of writes•When MDS detects conflicting accesses by different clients to the same file–Revokes all caching and buffering permissions–Requires synchronous I/O to that fileClient synchronization (II)•Synchronization handled by OSDs–Locks can be used for writes spanning object boundaries•Synchronous I/O operations have huge latencies•Many scientific workloads do significant amount of read-write sharing–POSIX extension lets applications synchronize their concurrent accesses to a fileNamespace operations•Managed by the MDSs–Read and update operations are all synchronously applied to the metadata•Optimized for common case–readdir returns contents of whole directory (as NFS readdirplus does)•Guarantees serializability of all operations–Can be relaxed by applicationTHE MDS CLUSTER•Storing metadata•Dynamic subtree partitioning•Mapping subdirectories to MDSsStoring metadata•Most requests likely to be satisfied from MDS in-memory cache•Each MDS lodges its update operations in lazily-flushed journal–Facilitates recovery•Directories–Include i-nodes–Stored on a OSD clusterDynamic subtree partitioning•Ceph uses primary copy approach to cached metadata management•Ceph adaptively distributes cached metadata across MDS nodes– Each MDS measures popularity of data within a directory–Ceph migrates and/or replicates hot spotsMapping subdirectories to MDSsDISTRIBUTED OBJECT STORAGE•Data distribution with CRUSH•Replication•Data safety•Recovery and cluster updates•EBOFSData distribution with CRUSH (I)•Wanted to avoid storing object addresses in MDS cluster•Ceph firsts maps objects into placement groups (PG) using a hash function•Placement groups are then assigned to OSDs using a pseudo-random function (CRUSH)–Clients know that functionData distribution with CRUSH (II)•To access an object, client needs to know–Its placement group–The OSD cluster map–The object placement rules used by CRUSH•Replication level•Placement constraintsHow files are stripedReplication•Ceph’s Reliable Autonomic Data Object Store autonomously manages object replication•First non-failed OSD in object’s replication list acts as a primary copy–Applies each update locally–Increments object’s version number–Propagates the updateData safety•Achieved by update process1. Primary forwards updates to other replicas 2. Sends ACK to client once all replicas have received the update•Slower but safer3. Replicas send final commit once they have committed update to diskCommitting writesRecovery and cluster updates•RADOS (Reliable and Autonomous Distributed Object Store) monitors OSDs to detect failures•Recovery handled by same mechanism as deployment of new

View Full Document