UT Dallas CS 6350 - BigDataHadoop_PPT_Lesson06 - D3105871

Home> Schools> University of Texas at Dallas> Computer Science (CS) > CS 6350> BigDataHadoop_PPT_Lesson06

DOC PREVIEW

UT Dallas CS 6350 - BigDataHadoop_PPT_Lesson06

School name University of Texas at Dallas

Course Cs 6350- Big Data Management and Analytics

Pages 49

This preview shows page 1-2-3-23-24-25-26-47-48-49 out of 49 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 49 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 49 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 49 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 49 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 49 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 49 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 49 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 49 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 49 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 49 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 49 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Copyright 2014, Simplilearn, All rights reserved. Copyright 2014, Simplilearn, All rights reserved. Lesson 6—Advanced HDFS and MapReduce Big Data and Hadoop DeveloperCopyright 2014, Simplilearn, All rights reserved. Objectives By the end of this lesson, you will be able to: ● Explain advanced HDFS and related concepts ● Identify the steps to decommission a DataNode ● Explain advanced MapReduce concepts ● Describe the various joins in MapReduceCopyright 2014, Simplilearn, All rights reserved. Advanced HDFS–Introduction The Hadoop Distributed File System (HDFS) is a block-structured, distributed file system. It is designed to run on small commodity machines in a way that the performance of running jobs will be better when compared to single standalone dedicated servers. Some of the settings in HDFS Decommissioning a DataNode Setting up HDFS block size HDFS BenchmarkingCopyright 2014, Simplilearn, All rights reserved. HDFS Benchmarking refers to verifying if the HDFS cluster is properly set up and is performing as expected by the administrator. You will be using DFSIO to test the Input/Output (I/O) performance of HDFS cluster. The image below shows a command you can use for the write operation. HDFS BenchmarkingCopyright 2014, Simplilearn, All rights reserved. The image shows a command you can use for the read operation. This benchmark uses the file written by the write command executed earlier. HDFS Benchmarking (contd.)Copyright 2014, Simplilearn, All rights reserved. HDFS stores files across a cluster by breaking data into fixed size blocks. The default size of a block is 64 MB. In the next two steps, you will find that the block size is increased to 128 MB. Here are the steps you need to follow to set up the HDFS block size: Setting Up HDFS Block Size Open the hdfs-site.xml file. Type the block size command in hdfs-site.xml file and save it. 1 2Copyright 2014, Simplilearn, All rights reserved. Open the hdfs-site.xml using the command: sudo vi /usr/hadoop/conf/hdfs-site.xml given below. Setting Up HDFS Block Size—Step 1Copyright 2014, Simplilearn, All rights reserved. Type the following command in hdfs-site.xml and save it. It will ensure that the block size of any file uploaded in HDFS from now on will be 128MB. Setting Up HDFS Block Size—Step 2Copyright 2014, Simplilearn, All rights reserved. Decommissioning can be defined as follows: Decommissioning a DataNode It refers to disconnecting the DataNode servers from the cluster's network. You will need to perform the following steps to decommission a DataNode: Save the file. Create a file named ‘exclude’ in the /usr/local/hadoop/conf location. 1 Type the IP address of the nodes. 2 Run the command to decommission the IP address. 4 3Copyright 2014, Simplilearn, All rights reserved. Create a file named ‘exclude’ in the /usr/local/hadoop/conf location using the command shown below. Decommissioning a DataNode—Step 1Copyright 2014, Simplilearn, All rights reserved. Type the IP address of the nodes that have to be decommissioned, as shown below. Decommissioning a DataNode—Step 2Copyright 2014, Simplilearn, All rights reserved. Save the file and run the command shown below to decommission the IP address that is specified in the ‘exclude’ file. Decommissioning a DataNode—Steps 3 and 4Copyright 2014, Simplilearn, All rights reserved. The demos in this lesson will illustrate how to set up HDFS and write simple MapReduce programs. Olivia is the EVP—IT Operations at Nutri Worldwide Inc. Her team is involved in setting up Hadoop infrastructure for the organization. After performing the steps to set up the Hadoop infrastructure, Olivia and her team decides to test the effectiveness of the HDFS infrastructure. Business ScenarioCopyright 2014, Simplilearn, All rights reserved. Demo 1 HDFS demoCopyright 2014, Simplilearn, All rights reserved. Hadoop MapReduce uses data types when it works with user-given mappers and reducers. The data is read from files into mappers and emitted by mappers to reducers. The processed data is sent back by the reducers. Data emitted by reducers go into output files. At every step, data is stored in Java objects. Writable data types: In the Hadoop environment, objects that can be put to or received from files and across the network must obey a particular interface called Writable. Advanced MapReduce ! Writable interface allows Hadoop to read and write data in a serialized form for transmission.Copyright 2014, Simplilearn, All rights reserved. The interfaces in Hadoop are as follows: Interfaces WritableComparable Writable Writable interface allows Hadoop to read and write the data in a serialized form for transmission. interface Writable { public void readFields(DataInput in); public void write(DataOutput out); }Copyright 2014, Simplilearn, All rights reserved. The interfaces in Hadoop are as follows: Interfaces (contd.) WritableComparable Writable WritableComparable interface extends the Writable interface so that the data can be used as a key and not as a value. int compareTo(Object what) int hashCode()Copyright 2014, Simplilearn, All rights reserved. The table lists a few important data types and their functions. Data Types in Hadoop Data types Functions Text Stores String data IntWritable Stores Integer data LongWritable Stores Long data FloatWritable Stores Float data DoubleWritable Stores Double data BooleanWritable Stores Boolean data ByteWritable Stores Byte data NullWritable Placeholder when value is not neededCopyright 2014, Simplilearn, All rights reserved. A sample data type related to the Writable interface is displayed here: Data Types in Hadoop (contd.) Writable It defines a de/serialization protocol. Every data type in Hadoop is a Writable. WritableComparable It defines a sort order. All keys must be of this type (but not values). IntWritable LongWritable Text … These are concrete classes for different data types. SequenceFiles It refers to binary encoded of a sequence of key-value pairs.Copyright 2014, Simplilearn, All rights reserved. A MapReduce may specify how its input is to be read by specifying an InputFormat to be used. The table below lists some of the key classes of InputFormats provided by the Hadoop framework. InputFormats in MapReduce InputFormat classes Description KeyValueTextInputFormat One key-value pair per line TextInputFormat Key is the line number and value is the line

View Full Document