Duke CPS 214 - The Google File System

Unformatted text preview:

The Google File SystemBy Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung(Presented at SOSP 2003) Introduction Google – search engine.  Applications process lots of data. Need good file system. Solution: Google File System (GFS).Motivational Facts More than 15,000 commodity-class PC's. Multiple clusters distributed worldwide. Thousands of queries served per second. One query reads 100's of MB of data. One query consumes 10's of billions of CPU cycles. Google stores dozens of copies of the entire Web! Conclusion: Need large, distributed, highly fault-tolerant file system. Topics Design Motivations Architecture Read/Write/Record Append Fault-Tolerance Performance ResultsDesign Motivations1. Fault-tolerance and auto-recovery need to be built into the system. 2. Standard I/O assumptions (e.g. block size) have to be re-examined. 3. Record appends are the prevalent form of writing. 4. Google applications and GFS should be co-designed. GFS Architecture (Analogy) On a single-machine FS: An upper layer maintains the metadata.  A lower layer (i.e. disk) stores the data in units called “blocks”.  Upper layer store In the GFS: A master process maintains the metadata.  A lower layer (i.e. a set of chunkservers) stores the data in units called “chunks”.GFS ArchitectureMasterMetadataChunkserverLinux FSChunkserverLinux FSClient(request for metadata)(metadata reponse)( read/write request)( read/write response)GFS ArchitectureWhat is a chunk?  Analogous to block, except larger. Size: 64 MB! Stored on chunkserver as file Chunk handle (~ chunk file name) used to reference chunk. Chunk replicated across multiple chunkservers Note: There are hundreds of chunkservers in a GFS cluster distributed over multiple racks.GFS ArchitectureWhat is a master?  A single process running on a separate machine.  Stores all metadata:  File namespace File to chunk mappings Chunk location information Access control information Chunk version numbers Etc.GFS ArchitectureMaster <-> Chunkserver Communication: Master and chunkserver communicate regularly to obtain state:  Is chunkserver down?  Are there disk failures on chunkserver?  Are any replicas corrupted?  Which chunk replicas does chunkserver store?  Master sends instructions to chunkserver: Delete existing chunk.  Create new chunk.GFS ArchitectureServing Requests:  Client retrieves metadata for operation from master.  Read/Write data flows between client and chunkserver.  Single master is not bottleneck, because its involvement with read/write operations is minimized. Overview Design Motivations Architecture Master Chunkservers Clients Read/Write/Record Append Fault-Tolerance Performance ResultsAnd now for the Meat…Read AlgorithmApplicationGFS Client(file name, byte range)Master(file name, chunk index)(chunk handle, replica locations)213Read AlgorithmApplicationGFS ClientChunk ServerChunk ServerChunk Server(chunk handle, byte range)(data from file)(data from file)456Read Algorithm1. Application originates the read request.2. GFS client translates the request from (filename, byte range) -> (filename, chunk index), and sends it to master. 3. Master responds with chunk handle and replica locations (i.e. chunkservers where the replicas are stored). 4. Client picks a location and sends the (chunk handle, byte range) request to that location.5. Chunkserver sends requested data to the client.6. Client forwards the data to the application.Read Algorithm (Example)IndexerGFS Client(crawl_99, 2048 bytes)(crawl_99, index: 3)(ch_1003, {chunkservers: 4,7,9})213Mastercrawl_99Ch_1001{3,8,12}Ch_1002{1,8,14}Ch_1003{4,7,9}Read Algorithm (Example)Calculating chunk index from byte range:(Assumption: File position is 201,359,161 bytes) Chunk size = 64 MB.  64 MB = 1024 *1024 * 64 bytes = 67,108,864 bytes.  201,359,161 bytes = 67,108,864 * 2 + 32,569 bytes.  So, client translates 2048 byte range -> chunk index 3.Read Algorithm (Example)ApplicationGFS ClientChunk Server #4Chunk Server #7Chunk Server #9(ch_1003, {chunkservers: 4,7,9})(2048 bytes of data)(2048 bytes of data)456Write AlgorithmApplicationGFS Client(file name, data)Master(file name, chunk index)(chunk handle, primary and secondary replica locations)213Write AlgorithmApplicationGFS Client4BufferChunkPrimarySecondaryBufferChunkSecondaryBufferChunk(Data)(Data)(Data)Write AlgorithmApplicationGFS Client5D1 | D2| D3| D4ChunkPrimarySecondaryD1 | D2| D3| D4ChunkSecondaryD1 | D2| D3| D4Chunk(Write command)67(write command, serial order)Write AlgorithmApplicationGFS Client(empty)ChunkPrimarySecondary(empty)ChunkSecondary(empty)Chunk(response)(response)89Write Algorithm1. Application originates write request.2. GFS client translates request from (filename, data) -> (filename, chunk index), and sends it to master. 3. Master responds with chunk handle and (primary + secondary) replica locations.4. Client pushes write data to all locations. Data is stored in chunkservers’ internal buffers. 5. Client sends write command to primary.Write Algorithm6. Primary determines serial order for data instances stored in its buffer and writes the instances in that order to the chunk.7. Primary sends serial order to the secondaries and tells them to perform the write.8. Secondaries respond to the primary.9. Primary responds back to client.Note: If write fails at one of chunkservers, client is informed and retries the write. Record Append AlgorithmImportant operation at Google:  Merging results from multiple machines in one file. Using file as producer - consumer queue. 1. Application originates record append request.2. GFS client translates request and sends it to master. 3. Master responds with chunk handle and (primary + secondary) replica locations.4. Client pushes write data to all locations.Record Append Algorithm5. Primary checks if record fits in specified chunk. 6. If record does not fit, then the primary: • pads the chunk, • tells secondaries to do the same, • and informs the client. • Client then retries the append with the next chunk. 7. If record fits, then the primary: • appends the record, • tells secondaries to do the same, • receives responses from secondaries, • and sends final response to the client. Observations Clients can read in parallel.  Clients can write in parallel.  Clients can append records in


View Full Document

Duke CPS 214 - The Google File System

Download The Google File System
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view The Google File System and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view The Google File System 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?