Copyright 2014, Simplilearn, All rights reserved Copyright 2014, Simplilearn, All rights reserved Lesson 5—Introduction to MapReduce Big Data and Hadoop DeveloperCopyright 2014, Simplilearn, All rights reserved Objectives ● Explain the concepts of MapReduce ● List the steps to install Hadoop in Ubuntu machine ● Explain the roles of user and system By the end of this lesson, you will be able to:Copyright 2014, Simplilearn, All rights reserved. MapReduce is a programming model, and an associated implementation for processing and generating large data sets with parallel and distributed algorithms on a cluster. MapReduce operation includes: MapReduce—Introduction MapReduce can be applied to significantly larger datasets when compared to "commodity" servers. ! Specify computation in terms of map and reduce function Parallel computation across large-scale clusters of machines Ensure efficient communication between the nodes Handle machine failures and performance issuesCopyright 2014, Simplilearn, All rights reserved. MapReduce—Analogy Using manual vote counting after election as analogy, the MapReduce steps are illustrated: 1 2 3 Polling booth ballots Poll count In each booth Total count for each candidate Tellers Election resultsCopyright 2014, Simplilearn, All rights reserved Conclusion: The key reason to perform mapping and reducing is to speed up the execution of a specific process by splitting the process into a number of tasks, thus enabling parallel work. MapReduce—Analogy (contd.) Vs. Individual work Parallel workCopyright 2014, Simplilearn, All rights reserved Count the number of words: This quick brown fox jumps over the lazy dog. A dog is a man’s best friend. MapReduce—ExampleCopyright 2014, Simplilearn, All rights reserved. Map Execution Map execution consists of following phases: ● Each mapper must determine which reducer will receive each of the outputs ● For any key, the destination partition is same ● Number of partitions = Number of reducers Map phase Partition phase Shuffle phase ● Fetches input data from all map tasks for the portion corresponding to the reduce task’s bucket ● Reads assigned input split from HDFS ● Parses input into records (key/value pairs) ● Applies map function to each record ● Informs master node of its completion Sort phase ● Merge-sorts all map outputs into a single run Reduce phase ● Applies user-defined reduce function to the merged run ● Arguments: key and corresponding list of values ● Writes output to a file in HDFSCopyright 2014, Simplilearn, All rights reserved Map Execution—Distributed Two Node Environment MapReduce execution in a distributed two node environment: ● The Mappers on each of the nodes are assigned an input split of blocks. Based on the input format, the Record Reader reads the split as key, value pair. ● Map function is applied to each record to return zero or more new records. Intermediate output are stored in the local file system as a file and a partitioner assigns the records to a reducer. ● The intermediate key, value pairs are exchanged by all nodes. The key, value pairs are sorted by key and reduce function applied on it. ● The output is stored in HDFS based on the specified output format.Copyright 2014, Simplilearn, All rights reserved. The essentials of each MapReduce phase are as follows: MapReduce Essentials Input specification in key-value pairs. A user defined map function is applied to each input record to produce a list of intermediate key-value pairs. A user-defined reduce function is called once for each distinct key in the map output. The list of intermediate values associated with that key is passed. 1 2 3Copyright 2014, Simplilearn, All rights reserved. A job is a full MapReduce program, which typically will cause multiple Map and Reduce functions to be run in parallel over the life of the program. A Task is a map or reduce function executed on a subset of data. JobTracker and TaskTracker functions are as follows: MapReduce Jobs ● It runs on a separate node and accepts job requests from client. It divides job requests into tasks and schedules, and assigns tasks to slaves nodes. ● It is a single point of failure. JobTracker ● It manages each active task. ● It communicates regularly with JobTracker. TaskTrackerCopyright 2014, Simplilearn, All rights reserved MapReduce Engine JobTracker and the TaskTracker are responsible for performing MapReduce operations. ● Generally, the JobTracker is present in the master node of the Hadoop and the TaskTracker service is present in the slave nodes. ● The JobTracker service is responsible for assigning the jobs to the DataNode. The DataNode consists of the TaskTracker which performs the tasks that are submitted by the JobTracker and provides the results back to the JobTracker.Copyright 2014, Simplilearn, All rights reserved MapReduce and Associated Tasks Some MapReduce and associated tasks are: ● Map process is an initial ingestion and transformation step, where individual input records can be processed in parallel. ● Reduce process is an aggregation or summarization step, in which all associated records must be processed together in a group. ● Task trackers keep track of individual map tasks, and can run in parallel. A map job runs on a particular task tracker slave node. ● JobTracker keeps track of a MapReduce job. o It communicates with task trackers to get the status of the job completion. o It allocates a map task to a particular TaskTracker node. MapReduce and Associated TasksCopyright 2014, Simplilearn, All rights reserved Hadoop Job Flow The Hadoop job flow schematic is represented through a possible configuration: MapReduce Association with HDFSCopyright 2014, Simplilearn, All rights reserved. Hadoop Job Work Interaction The Hadoop job work interaction consists of following phases: Job submission Distribution of input split Coordination with TaskTracker Hadoop MapReduce job is submitted by a client in the form of input file or a number of input split of files containing data JobTracker distributes the input split to separate task trackers JobTracker coordinates with the task trackers Resubmission of tasks JobTracker resubmits task to alternate TaskTracker if the DataNode fails Success or failure status JobTracker gathers the final output and informs the client with success or failure statusCopyright 2014, Simplilearn, All rights reserved Some
View Full Document