UT Dallas CS 6350 - BigDataHadoop_PPT_Lesson03 - D3105868

Home> Schools> University of Texas at Dallas> Computer Science (CS) > CS 6350> BigDataHadoop_PPT_Lesson03

DOC PREVIEW

UT Dallas CS 6350 - BigDataHadoop_PPT_Lesson03

School name University of Texas at Dallas

Course Cs 6350- Big Data Management and Analytics

Pages 49

This preview shows page 1-2-3-23-24-25-26-47-48-49 out of 49 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 49 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 49 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 49 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 49 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 49 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 49 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 49 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 49 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 49 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 49 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 49 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Copyright 2014, Simplilearn, All rights reserved Copyright 2014, Simplilearn, All rights reserved Lesson 3—Hadoop Architecture Big Data and Hadoop DeveloperCopyright 2014, Simplilearn, All rights reserved Objectives Objectives ● Describe the use of Hadoop in commodity hardware ● Explain the various configurations and services of Hadoop ● Differentiate between regular file system and Hadoop Distributed File System (HDFS) ● Explain HDFS architecture By the end of this lesson, you will be able to:Copyright 2014, Simplilearn, All rights reserved Hadoop Cluster Using Commodity Hardware Some key terms used while discussing Hadoop Architecture: ● Commodity hardware: PCs which can be used to make a cluster ● Cluster: Interconnection of systems in a network ● Node: Commodity servers interconnected through a network device Key TermsCopyright 2014, Simplilearn, All rights reserved Hadoop Cluster Using Commodity Hardware Hadoop Cluster Using Commodity Hardware Hadoop supports the concept of distributed architecture. The diagram represents the nodes connected and installed with Hadoop. ● The number of nodes in a rack depends on the network speed. ● Uplink from rack to node is 3 to 4 Gb/s. ● Uplink from rack to rack is 1 Gb/s.Copyright 2014, Simplilearn, All rights reserved Hadoop Configuration Standalone, pseudo distributed, and fully distributed are three modes of Hadoop configuration. Hadoop Configuration Modes of Hadoop configuration Standalone mode Pseudo distributed mode Fully distributed mode All Hadoop services run in a single JVM, on a single machine. Individual Hadoop services run in an individual JVM, but on a single machine. Hadoop services run in different JVMs, but belong to one cluster.Copyright 2014, Simplilearn, All rights reserved Hadoop Core Services Hadoop Core Services The core services of Hadoop are: ● NameNode ● DataNode ● JobTracker ● TaskTracker ● Secondary NameNodeCopyright 2014, Simplilearn, All rights reserved Apache Hadoop Core Components Hadoop HDFS and Hadoop MapReduce are the core components of Hadoop. Apache Hadoop Core ComponentsCopyright 2014, Simplilearn, All rights reserved Hadoop Core Components – HDFS The key features of Hadoop HDFS are as follows: ● provides high-throughput access to data blocks; ● provides limited interface for managing the file system to allow it to scale; and ● creates multiple replicas of each data block and distributes them on computers throughout the cluster to enable reliable and rapid data access. Hadoop Core Components—HDFSCopyright 2014, Simplilearn, All rights reserved Hadoop Core Components – MapReduce The key features of Hadoop MapReduce are as follows: ● performs distributed data processing using the MapReduce programming paradigm; ● allows to possess user-defined map phase, which is a parallel, share-nothing processing of input (MapReduce paradigm); and ● the aggregating the output of the map phase, which is a user-defined, reduces phase after a map process. Hadoop Core Components—MapReduceCopyright 2014, Simplilearn, All rights reserved Regular File System v/s HDFS Regular File System vs. HDFS ● Each block of data is small in size; approximately 51 bytes ● Large data access suffers from disk I/O problems; mainly because of multiple seek operation ● Each block of data is very large in size; 64MB by default ● Reads huge data sequentially after a single seek A simple comparison between regular file system and HDFS is summarized below: Regular File System HDFSCopyright 2014, Simplilearn, All rights reserved Hadoop Core Components – MapReduce The basic characteristics of HDFS that make it popular are: ● High fault-tolerance ● High throughput ● Suitable for applications with large data sets ● Suitable for applications with streaming access to file system data ● Can be built on commodity hardware and heterogeneous platforms HDFS—CharacteristicsCopyright 2014, Simplilearn, All rights reserved HDFS – Key Features Some key features of HDFS: ● HDFS creates multiple replicas of each data block and distributes them on computers throughout a cluster to enable reliable and rapid access. ● HDFS is the storage system for both input and output of the MapReduce jobs. Hadoop file URL is to be specified like hdfs://filename. ● Block storage meta data controls the physical location of the block and replication within the cluster. ● Each block is replicated to a small number of physically separate machines. HDFS—Key FeaturesCopyright 2014, Simplilearn, All rights reserved HDFS Layer HDFS Architecture HDFS architecture can be summarized as follows: ● NameNode and the Secondary NameNode services constitute the master service. DataNode service is the slave service. ● The master service is responsible for accepting a job from clients and ensures that the data required for the operation will be loaded and segregated into chunks of data blocks. ● HDFS exposes a file system namespace and allows user data to be stored in files. A file is split into one or more blocks stored and replicated in DataNodes. The data blocks are then distributed to the DataNode systems within the cluster. This ensures that replicas of the data are maintained.Copyright 2014, Simplilearn, All rights reserved. The HDFS components comprise different servers like NameNode, DataNode, and Secondary NameNode. HDFS Operation Principle ● Maintains the file system name space ● Manages the files and directories in the file system tree ● Stores information in the namespace image and the edit log ● NameNode knows the data nodes on which all the blocks for a given file exist ● NameNode is a critical one point failure node NameNode Server (single instance) ● Associated with data storage places in the file system ● Reports to NameNode periodically with lists of blocks they store ● Stores and retrieves blocks when referred by clients or NameNode ● Servers read, write requests, performs block creation, deletion, and replication upon instruction from NameNode DataNode Server (multiple instances) ● Not exactly a hot backup of the actual NameNode server ● Used for recovery of NameNode in case of NameNode failure ● Keeps namespace image through edit log periodically ● Namespace image lags behind, so total recovery is impossible Secondary NameNode Server (single instance)Copyright 2014, Simplilearn, All rights reserved HDFS 1.0 HDFS HDFS is the place where data

View Full Document