Big Data and Hadoop Developer Lesson 6 Advanced HDFS and MapReduce Copyright 2014 Simplilearn All rights reserved Copyright 2014 Simplilearn All rights reserved Objectives By the end of this lesson you will be able to Explain advanced HDFS and related concepts Identify the steps to decommission a DataNode Explain advanced MapReduce concepts Describe the various joins in MapReduce Copyright 2014 Simplilearn All rights reserved Advanced HDFS Introduction The Hadoop Distributed File System HDFS is a block structured distributed file system It is designed to run on small commodity machines in a way that the performance of running jobs will be better when compared to single standalone dedicated servers Some of the settings in HDFS HDFS Benchmarking Setting up HDFS block size Decommissioning a DataNode Copyright 2014 Simplilearn All rights reserved HDFS Benchmarking HDFS Benchmarking refers to verifying if the HDFS cluster is properly set up and is performing as expected by the administrator You will be using DFSIO to test the Input Output I O performance of HDFS cluster The image below shows a command you can use for the write operation Copyright 2014 Simplilearn All rights reserved HDFS Benchmarking contd The image shows a command you can use for the read operation This benchmark uses the file written by the write command executed earlier Copyright 2014 Simplilearn All rights reserved Setting Up HDFS Block Size HDFS stores files across a cluster by breaking data into fixed size blocks The default size of a block is 64 MB In the next two steps you will find that the block size is increased to 128 MB Here are the steps you need to follow to set up the HDFS block size 1 Open the hdfs site xml file 2 Type the block size command in hdfs site xml file and save it Copyright 2014 Simplilearn All rights reserved Setting Up HDFS Block Size Step 1 Open the hdfs site xml using the command sudo vi usr hadoop conf hdfs site xml given below Copyright 2014 Simplilearn All rights reserved Setting Up HDFS Block Size Step 2 Type the following command in hdfs site xml and save it It will ensure that the block size of any file uploaded in HDFS from now on will be 128MB Copyright 2014 Simplilearn All rights reserved Decommissioning a DataNode Decommissioning can be defined as follows It refers to disconnecting the DataNode servers from the cluster s network You will need to perform the following steps to decommission a DataNode 1 2 Create a file named exclude in the usr local hadoop conf location 4 3 Type the IP address of the nodes Save the file Run the command to decommission the IP address Copyright 2014 Simplilearn All rights reserved Decommissioning a DataNode Step 1 Create a file named exclude in the usr local hadoop conf location using the command shown below Copyright 2014 Simplilearn All rights reserved Decommissioning a DataNode Step 2 Type the IP address of the nodes that have to be decommissioned as shown below Copyright 2014 Simplilearn All rights reserved Decommissioning a DataNode Steps 3 and 4 Save the file and run the command shown below to decommission the IP address that is specified in the exclude file Copyright 2014 Simplilearn All rights reserved Business Scenario Olivia is the EVP IT Operations at Nutri Worldwide Inc Her team is involved in setting up Hadoop infrastructure for the organization After performing the steps to set up the Hadoop infrastructure Olivia and her team decides to test the effectiveness of the HDFS infrastructure The demos in this lesson will illustrate how to set up HDFS and write simple MapReduce programs Copyright 2014 Simplilearn All rights reserved Demo 1 HDFS demo Copyright 2014 Simplilearn All rights reserved Advanced MapReduce Hadoop MapReduce uses data types when it works with user given mappers and reducers The data is read from files into mappers and emitted by mappers to reducers The processed data is sent back by the reducers Data emitted by reducers go into output files At every step data is stored in Java objects Writable data types In the Hadoop environment objects that can be put to or received from files and across the network must obey a particular interface called Writable Writable interface allows Hadoop to read and write data in a serialized form for transmission Copyright 2014 Simplilearn All rights reserved Interfaces The interfaces in Hadoop are as follows Writable interface allows Hadoop to read and write the data in a serialized form for transmission Writable WritableComparable interface Writable public void readFields DataInput in public void write DataOutput out Copyright 2014 Simplilearn All rights reserved Interfaces contd The interfaces in Hadoop are as follows WritableComparable interface extends the Writable interface so that the data can be used as a key and not as a value Writable WritableComparable int compareTo Object what int hashCode Copyright 2014 Simplilearn All rights reserved Data Types in Hadoop The table lists a few important data types and their functions Data types Functions Text Stores String data IntWritable Stores Integer data LongWritable Stores Long data FloatWritable Stores Float data DoubleWritable Stores Double data BooleanWritable Stores Boolean data ByteWritable Stores Byte data NullWritable Placeholder when value is not needed Copyright 2014 Simplilearn All rights reserved Data Types in Hadoop contd A sample data type related to the Writable interface is displayed here Writable WritableComparable IntWritable LongWritable Text SequenceFiles It defines a de serialization protocol Every data type in Hadoop is a Writable It defines a sort order All keys must be of this type but not values These are concrete classes for different data types It refers to binary encoded of a sequence of key value pairs Copyright 2014 Simplilearn All rights reserved InputFormats in MapReduce A MapReduce may specify how its input is to be read by specifying an InputFormat to be used The table below lists some of the key classes of InputFormats provided by the Hadoop framework InputFormat classes Description KeyValueTextInputFormat One key value pair per line TextInputFormat Key is the line number and value is the line NLineInputFormat Similar to TextInputFormat but the difference is that there are N number of lines that make an input split MultiFileInputFormat Input format implemented by a user that aggregates multiple files into one split SequenceFileInputFormat The input file is a Hadoop sequence file which contains a
View Full Document