Big Data and Hadoop Developer Lesson 5 Introduction to MapReduce Copyright 2014 Simplilearn All rights reserved Copyright 2014 Simplilearn All rights reserved Objectives By the end of this lesson you will be able to Explain the concepts of MapReduce List the steps to install Hadoop in Ubuntu machine Explain the roles of user and system Copyright 2014 Simplilearn All rights reserved MapReduce Introduction MapReduce is a programming model and an associated implementation for processing and generating large data sets with parallel and distributed algorithms on a cluster MapReduce operation includes Specify computation in terms of map and reduce function Parallel computation across large scale clusters of machines Handle machine failures and performance issues Ensure efficient communication between the nodes MapReduce can be applied to significantly larger datasets when compared to commodity servers Copyright 2014 Simplilearn All rights reserved MapReduce Analogy Using manual vote counting after election as analogy the MapReduce steps are illustrated 1 2 3 Election results Polling booth ballots Tellers Poll count In each booth Total count for each candidate Copyright 2014 Simplilearn All rights reserved MapReduce Analogy contd Conclusion The key reason to perform mapping and reducing is to speed up the execution of a specific process by splitting the process into a number of tasks thus enabling parallel work Vs Individual work Parallel work Copyright 2014 Simplilearn All rights reserved MapReduce Example Count the number of words This quick brown fox jumps over the lazy dog A dog is a man s best friend Copyright 2014 Simplilearn All rights reserved Map Execution Map execution consists of following phases Map phase Reads assigned input split from HDFS Parses input into records key value pairs Each mapper must determine which reducer will receive each of the outputs Applies map function to each record For any key the destination partition is same Number of partitions Number of reducers Informs master node of its completion Shuffle phase Partition phase Fetches input data from all map tasks for the portion corresponding to the reduce task s bucket Sort phase Reduce phase Merge sorts all map outputs into a single run Applies userdefined reduce function to the merged run Arguments key and corresponding list of values Writes output to a file in HDFS Copyright 2014 Simplilearn All rights reserved Map Execution Distributed Two Node Environment MapReduce execution in a distributed two node environment The Mappers on each of the nodes are assigned an input split of blocks Based on the input format the Record Reader reads the split as key value pair Map function is applied to each record to return zero or more new records Intermediate output are stored in the local file system as a file and a partitioner assigns the records to a reducer The intermediate key value pairs are exchanged by all nodes The key value pairs are sorted by key and reduce function applied on it The output is stored in HDFS based on the specified output format Copyright 2014 Simplilearn All rights reserved MapReduce Essentials The essentials of each MapReduce phase are as follows 2 1 Input specification in keyvalue pairs 3 A user defined map function is applied to each input record to produce a list of intermediate keyvalue pairs A user defined reduce function is called once for each distinct key in the map output The list of intermediate values associated with that key is passed Copyright 2014 Simplilearn All rights reserved MapReduce Jobs A job is a full MapReduce program which typically will cause multiple Map and Reduce functions to be run in parallel over the life of the program A Task is a map or reduce function executed on a subset of data JobTracker and TaskTracker functions are as follows It runs on a separate node and accepts job requests from client It divides job requests into tasks and schedules and assigns tasks to slaves nodes It is a single point of failure JobTracker TaskTracker It manages each active task It communicates regularly with JobTracker Copyright 2014 Simplilearn All rights reserved MapReduce Engine JobTracker and the TaskTracker are responsible for performing MapReduce operations Generally the JobTracker is present in the master node of the Hadoop and the TaskTracker service is present in the slave nodes The JobTracker service is responsible for assigning the jobs to the DataNode The DataNode consists of the TaskTracker which performs the tasks that are submitted by the JobTracker and provides the results back to the JobTracker Copyright 2014 Simplilearn All rights reserved MapReduce andAssociated Associated Tasks MapReduce and Tasks Some MapReduce and associated tasks are Map process is an initial ingestion and transformation step where individual input records can be processed in parallel Reduce process is an aggregation or summarization step in which all associated records must be processed together in a group Task trackers keep track of individual map tasks and can run in parallel A map job runs on a particular task tracker slave node JobTracker keeps track of a MapReduce job o It communicates with task trackers to get the status of the job completion o It allocates a map task to a particular TaskTracker node Copyright 2014 Simplilearn All rights reserved Hadoop JobAssociation Flow MapReduce with HDFS The Hadoop job flow schematic is represented through a possible configuration Copyright 2014 Simplilearn All rights reserved Hadoop Job Work Interaction The Hadoop job work interaction consists of following phases Job submission Distribution of input split Coordination with TaskTracker Resubmission of tasks Success or failure status Hadoop JobTracker JobTracker JobTracker JobTracker gathers MapReduce job is distributes the coordinates with resubmits task to the final output submitted by a input split to the task trackers alternate and informs the client in the form separate task TaskTracker if the client with success of input file or a trackers DataNode fails or failure status number of input split of files containing data Copyright 2014 Simplilearn All rights reserved Characteristics of MapReduce Some MapReduce characteristics are as follows It handle very large scale data peta exa bytes and so on It works well on Write once and read many WORM data It allows parallelism without mutexes The Map and reduce operations are typically performed by the same physical processor The operations are
View Full Document