Hadoop File SystemReferenceBasic Features: HDFSFault toleranceData CharacteristicsMapReduceArchitectureNamenode and DatanodesHDFS ArchitectureFile system NamespaceData ReplicationReplica PlacementReplica SelectionSafemode StartupFilesystem MetadataNamenodeDatanodeProtocolThe Communication ProtocolRobustnessObjectivesDataNode failure and heartbeatRe-replicationCluster RebalancingData IntegrityMetadata Disk FailureData OrganizationData BlocksStagingStaging (contd.)Replication PipeliningAPI (Accessibility)Application Programming InterfaceFS Shell, Admin and Browser InterfaceSpace ReclamationSummaryB. RAMAMURTHYHadoop File System01/14/191ReferenceThe Hadoop Distributed File System: Architecture and Design by Apache Foundation Inc.01/14/192Basic Features: HDFSHighly fault-tolerantHigh throughputSuitable for applications with large data setsStreaming access to file system dataCan be built out of commodity hardware 01/14/193Fault toleranceFailure is the norm rather than exceptionA HDFS instance may consist of thousands of server machines, each storing part of the file system’s data.Since we have huge number of components and that each component has non-trivial probability of failure means that there is always some component that is non-functional.Detection of faults and quick, automatic recovery from them is a core architectural goal of HDFS.01/14/194Data CharacteristicsStreaming data accessApplications need streaming access to dataBatch processing rather than interactive user access.Large data sets and files: gigabytes to terabytes sizeHigh aggregate data bandwidthScale to hundreds of nodes in a clusterTens of millions of files in a single instanceWrite-once-read-many: a file once created, written and closed need not be changed – this assumption simplifies coherencyA map-reduce application or web-crawler application fits perfectly with this model.01/14/195CatBatDogOther Words(size:TByte)mapmapmapmapsplitsplitsplitsplitcombinecombinecombinereducereducereducepart0part1part2MapReduce 01/14/196Architecture01/14/197Namenode and DatanodesMaster/slave architectureHDFS cluster consists of a single Namenode, a master server that manages the file system namespace and regulates access to files by clients.There are a number of DataNodes usually one per node in a cluster.The DataNodes manage storage attached to the nodes that they run on.HDFS exposes a file system namespace and allows user data to be stored in files.A file is split into one or more blocks and set of blocks are stored in DataNodes.DataNodes: serves read, write requests, performs block creation, deletion, and replication upon instruction from Namenode.01/14/198HDFS Architecture01/14/199NamenodeBreplicationRack1Rack2ClientBlocksDatanodesDatanodesClientWriteReadMetadata opsMetadata(Name, replicas..)(/home/foo/data,6. ..Block opsFile system Namespace01/14/1910Hierarchical file system with directories and filesCreate, remove, move, rename etc.Namenode maintains the file systemAny meta information changes to the file system recorded by the Namenode.An application can specify the number of replicas of the file needed: replication factor of the file. This information is stored in the Namenode.Data Replication01/14/1911HDFS is designed to store very large files across machines in a large cluster.Each file is a sequence of blocks.All blocks in the file except the last are of the same size.Blocks are replicated for fault tolerance.Block size and replicas are configurable per file.The Namenode receives a Heartbeat and a BlockReport from each DataNode in the cluster.BlockReport contains all the blocks on a Datanode.Replica Placement01/14/1912The placement of the replicas is critical to HDFS reliability and performance.Optimizing replica placement distinguishes HDFS from other distributed file systems.Rack-aware replica placement: Goal: improve reliability, availability and network bandwidth utilizationResearch topicMany racks, communication between racks are through switches.Network bandwidth between machines on the same rack is greater than those in different racks.Namenode determines the rack id for each DataNode.Replicas are typically placed on unique racks Simple but non-optimalWrites are expensiveReplication factor is 3Another research topic?Replicas are placed: one on a node in a local rack, one on a different node in the local rack and one on a node in a different rack.1/3 of the replica on a node, 2/3 on a rack and 1/3 distributed evenly across remaining racks.Replica Selection 01/14/1913Replica selection for READ operation: HDFS tries to minimize the bandwidth consumption and latency.If there is a replica on the Reader node then that is preferred.HDFS cluster may span multiple data centers: replica in the local data center is preferred over the remote one.Safemode Startup01/14/1914On startup Namenode enters Safemode. Replication of data blocks do not occur in Safemode.Each DataNode checks in with Heartbeat and BlockReport.Namenode verifies that each block has acceptable number of replicasAfter a configurable percentage of safely replicated blocks check in with the Namenode, Namenode exits Safemode.It then makes the list of blocks that need to be replicated.Namenode then proceeds to replicate these blocks to other Datanodes.Filesystem Metadata01/14/1915The HDFS namespace is stored by Namenode.Namenode uses a transaction log called the EditLog to record every change that occurs to the filesystem meta data.For example, creating a new file.Change replication factor of a fileEditLog is stored in the Namenode’s local filesystemEntire filesystem namespace including mapping of blocks to files and file system properties is stored in a file FsImage. Stored in Namenode’s local filesystem.Namenode 01/14/1916Keeps image of entire file system namespace and file Blockmap in memory.4GB of local RAM is sufficient to support the above data structures that represent the huge number of files and directories.When the Namenode starts up it gets the FsImage and Editlog from its local file system, update FsImage with EditLog information and then stores a copy of the FsImage on the filesytstem as a checkpoint.Periodic checkpointing is done. So that the system can recover back to the last checkpointed state in case of a
View Full Document