UT Dallas CS 6350 - BigDataHadoop_PPT_Lesson13 - D3105858

Home> Schools> University of Texas at Dallas> Computer Science (CS) > CS 6350> BigDataHadoop_PPT_Lesson13

DOC PREVIEW

UT Dallas CS 6350 - BigDataHadoop_PPT_Lesson13

School name University of Texas at Dallas

Course Cs 6350- Big Data Management and Analytics

Pages 42

This preview shows page 1-2-3-20-21-40-41-42 out of 42 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 42 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Big Data and Hadoop Developer Lesson 13 Hadoop Administration Troubleshooting and Security Copyright 2014 Simplilearn All rights reserved Copyright 2014 Simplilearn All rights reserved Objectives By the end of this lesson you will be able to Explain different configurations of the Hadoop cluster Identify different parameters for performance monitoring and performance tuning Explain configuration of security parameters in Hadoop Copyright 2014 Simplilearn All rights reserved Typical Hadoop Core Cluster A typical Hadoop Core cluster is made up of machines that run a set of cooperating server processes The machines in the cluster are not required to be homogeneous If the machines have similar processing power memory and disk bandwidth cluster administration becomes a lot easier o Only one set of configuration files and runtime environments needs to be maintained and distributed Copyright 2014 Simplilearn All rights reserved Load Balancer Load Balancer is a tool for balancing load for data once a request is generated by a user or an application start balancer sh To start the balancer stop balancer sh To stop the balancer Copyright 2014 Simplilearn All rights reserved Commands Used in Hadoop Programming Different commands are used in Hadoop programming The JobTracker is expected to run on the machine on which the scripts are executed The Hadoop Core servers load their configurations from files available in the configuration directory of any Hadoop Core installation slaves sh To run its arguments as a command on each of the hosts listed in the conf slaves file start mapred sh and To start and stop the MapReduce servers Starts or stops only the JobTracker stop mapred sh and TaskTracker nodes Copyright 2014 Simplilearn All rights reserved Different Configuration Files of Hadoop Cluster Configuration files are responsible for configuring the system for a specific task hadoop env sh Used to set the Hadoop environment settings like Java path security settings etc core site xml Used to define the NameNode and HDFS temporary directory mapred site xml Used to define the number of reducers mappers and other settings related to MapReduce operations masters Used to specify the Secondary NameNode in a clustered environment slaves Used to specify the data nodes in a clustered environment Copyright 2014 Simplilearn All rights reserved Properties of hadoop default xml hadoop default xml is used for setting up the parameters that maintain consistency in the Hadoop cluster with respect to distributed computing Properties defined through hadoop default xml Global properties Logging properties I O properties File system properties MapReduce properties IPC properties Settings that are responsible to be maintained throughout the cluster Related to input and output operations to and from HDFS cluster Settings related to log generation and maintenance Related to the input and output files during job execution Settings related to proper job execution like number of mappers Settings related to inter process communication Copyright 2014 Simplilearn All rights reserved Different Configurations for Hadoop Cluster Three critical parameters that must be configured for any Hadoop cluster are as follows hadoop tmp dir Used as a temporary directory for both local file system and HDFS fs default name Used to specify the NameNode machines hostname and port number mapred job tracker Defines the host and port that the MapReduce JobTracker runs Copyright 2014 Simplilearn All rights reserved Different Configurations for Hadoop Cluster contd Three critical parameters that must be configured for any Hadoop DFS are as follows dfs name dir Determines where on the local file system a NameNode metadata is stored May be a comma or space separated list of directories All the provided directories are used for redundant storage dfs data dir Determines where on the local file system a DataNode stores blocks May be a comma or space separated list of directories Follows distributed data among the directories HDFS replicates data storage blocks to multiple DataNodes Directory experiences bulk I O transactions mapred local dir A local directory where TaskTracker stores intermediate output May be a comma separated list of directories preferably on different devices I O is spread among the directories for increased performance Directory experiences bulk I O that has a short life Copyright 2014 Simplilearn All rights reserved Port Numbers for Individual Hadoop Services Port numbers for individual Hadoop services can be classified as follows Port Number Name of the Parameter Explanation for the Parameter 50030 mapred job tracker http address JobTracker administrative web GUI 50070 dfs http address NameNode administrative web GUI 50010 dfs datanode address DataNode control port 50020 dfs datanode ipc address DataNode IPC port used for block transfer 50060 mapred task tracker http address Per TaskTracker web interface 50075 dfs datanode http address Per DataNode web interface 50090 dfs secondary http address Per Secondary NameNode web interface 50470 dfs https address NameNode web GUI via HTTPS 50475 dfs datanode https address Per DataNode web GUI via HTTPS Copyright 2014 Simplilearn All rights reserved Performance Monitoring The performance of the cluster needs to be monitored to ensure that the resources are properly allocated and de allocated for optimum utilization This ensures that the resources are not idle The Hadoop framework provides several APIs for allowing external agents to provide monitoring services to the Hadoop Core service Following are few such agents JMX Nagios Ganglia Chukwa FailMon Copyright 2014 Simplilearn All rights reserved Performance Tuning Performance tuning is a method of making resources participate in a specific job so that the job will be done faster and better Factors considered during performance tuning Network Bandwidth Disk Throughput CPU Overhead Memory Copyright 2014 Simplilearn All rights reserved Parameters of Performance Tuning Performance Tuning is carried out tactfully by using the following parameters Parameters Function dfs datanode handler count Handle the number of server threads for the DataNode dfs datanode du reserved Used to reserve space in bytes per volume dfs replication Used to set the replication factor fs checkpoint dir Used in the local file system of the DFS Secondary NameNode to store temporary images and merge the same mapred local dir minspacestart Limits the job tasks for

View Full Document