The Google File System SOSP 2003 Alan Sussman CMSC 818S April 12 2007 Notes Project interim report due Wed 4 18 Final dates Next sets of papers Timur Bloom filters Overview Lots of interesting things here Usual distributed file system concerns performance scalability reliability availability But note the close ties to their applications web crawling web search and more Google Earth Google Maps Froogle distributed file system design is driven by performance needs and special characteristics of the apps Assumptions Application workload and technology environment component failures are normal occurrences so need to monitor detect errors tolerate faults and automatically recover basically large PC clusters each with lots of disk app bugs OS bugs human errors and hardware component failures disks memory connectors networking power supplies files are very large multi GB common each containing many app objects e g web documents and total dataset sizes many TB with billions of objects so standard file system parameter choices may not work e g block sizes Assumptions cont most files mutate by appending new data not overwriting and random writes are rare once written files only read and often only sequentially so optimize performance and guarantee atomicity for append writes and don t bother with client caching no temporal locality co design apps and file system API relaxed consistency model atomic append operation to allow multiple concurrent appends by different clients w o synchronization want high sustained bandwidth low latency not as important for bulk data processing Interface API File system interface but not standard POSIX API most usual file operations create delete open close read write snapshot copies a file or directory tree cheaply using copy on write record append allows multiple concurrent appends to the same file with guaranteed atomicity of each append useful for multi way merging of results and producerconsumer queries multiple producers 1 or more consumers Architecture Architecture cont Files divided into fixed size chunks and each chunk gets an immutable globally unique 64 bit chunk handle from the master at chunk creation time One master maintains all file system metadata namespace access control info mapping from files to chunks current chunk locations chunk lease management garbage collection of orphaned chunks chunk migration between chunkservers talks to each chunkserver periodically in heartbeat messages to send instructions and collect chunkserver state Multiple chunkservers store chunks on local disks as Linux files read write data specified by chunk handle and a byte range or append for write chunk replicated on multiple chunkservers for reliability 3 by default no caching of file data since Linux file system buffer cache does it Multiple clients code for client API linked into apps talks to master for metadata operations and to chunkservers to read write data for app no client caching of file data since apps either stream data or have too large working sets for a cache to help The master One master not really shadow masters later to use global info for chunk placement and replication only accessed for metadata which chunkserver s to access this is cached for a limited time in clients Chunk size is 64MB chunk replica is a Linux file on a chunkserver extended only as needed to minimize internal fragmentation large size is good because client requires fewer metadata accesses to master for chunk location info and location info can be cached in client reduces network overhead since client likely to perform multiple operations on a chunk so can keep persistent TCP connection to chunkserver reduces size of metadata on master so can keep metadata in memory disadvantage is that small files can cause hot spots for many clients accessing the same small file solution is more replication of small files and re working apps Master metadata File and chunk namespaces mappings from files to chunks locations of each chunk s replicas Namespaces and mappings must be persistent for chunk garbage collection replication for data in failed chunkservers chunk migration to balance load and disk space use Size of metadata not a problem acquired from chunkservers at master startup or when a chunkserver joins the cluster so the chunkserver is completely responsible for its own chunk data Periodically scans entire state in background done by logging mutations to an operation log stored on master s disk and replicated on other machines for reliability Chunk location information not stored persistently all stored in memory so fast access less than 64 bytes per 64MB chunk and most chunks are full and same for file namespace Operation log contains record of critical metadata changes also provides a time line to define the order of concurrent operations from multiple clients stored reliably and changes not made visible until changes are persistent like a database transaction log master recovers by replaying log from last checkpoint which is done periodically to keep the log small checkpoint creation is done in the background and when completed written to disk locally and remotely recovery only needs a checkpoint file and subsequent log files Consistency File namespace mutations e g file creation are atomic and handled in master with locking operation log defines a total ordering of the mutations Let s not get into the details but the model essentially provides well defined relaxed consistency guarantees that work well for the target applications And failure of a chunkserver after detection from lost heartbeats and or checksumming is handled by making more replicas so a chunk is lost only if all replicas lost before further replication Apps deal with the relaxed consistency model by using appends instead of overwrites checkpointing and writing self validating self identifying records so can detect inconsistencies if they occur Write Control and Data Flow Mutation performed at all chunk replicas with leases to maintain consistent order across replicas 1 2 3 4 5 6 7 Client asks master for primary replica Master replies with primary and secondaries cached in client Client pushes data to all replicas in any order stored in buffer cache Once all replicas ack client sends write request to primary which assigns serial number and applies mutation Primary forwards write request to all secondaries secondaries apply mutations in serial number order Secondaries reply to primary on completion of mutation
View Full Document
Unlocking...