MapReduce The Story of Sam Saliya Ekanayake SALSA HPC Group http salsahpc indiana edu Pervasive Technology Institute Indiana University Bloomington Sam s Mother Believed an apple a day keeps a doctor away Mother Sam An Apple One day Sam thought of drinking the apple He used a the to cut and a to make juice Next Day Sam applied his invention to all the fruits he could find in the fruit basket map reduce A list of values mapped into another list of values which gets reduced into a single value Classical Classical Notion Notion of of MapReduce MapReduce in in Functional Functional Programming Programming 18 Years Later Sam got his first job in JuiceRUs for his talent in making juice Wa i t Now it s not just one basket but a whole container of fruits Fruit s Large data and list of values for output Also they produce a list of juice types separately But Sam had just ONE and ONE NOT ENOUGH Brave Sam Implemented a parallel version of his innovation Each input to a map is a list of key value pairs a p pairs mapped A list of o key value into another list of key value pairs key and Each which outputgets of a grouped map is a by listthe of key value pairs reduced into a list of values a o p Grouped by key The MapReduce in The idea idea of MapReduce in Data Data Each input to a of reduce is a key valueIntensive Computing Intensive Computing list possibly a list of these depending on the grouping hashing mechanism e g a Reduced into a list of values Afterwards Sam realized To create his favorite mix fruit juice he can use a combiner after the reducers If several key value list fall into the same group based on the grouping hashing algorithm then use the blender reducer separately on each of them The knife mapper and blender reducer should not contain residue after use Side Effect Free In general reducer should be associative and commutative That s All Folks We think Sam was you Big Data Mining and Analytics Hadoop Nuts and Bolts Dr Latifur Khan Department of Computer Science University of Texas at Dallas Source http developer yahoo com hadoop tutorial module4 html Commodity Clusters MapReduce is designed to efficiently process large volumes of data by connecting many commodity computers together to work in parallel A theoretical 1000 CPU machine would cost a very large amount of money far more than 1000 singleCPU or 250 quad core machines MapReduce ties smaller and more reasonably priced machines together into a single cost effective commodity cluster 10 Isolated Tasks MapReduce divides the workload into multiple independent tasks and schedule them across cluster nodes A work performed by each task is done in isolation from one another The amount of communication which can be performed by tasks is mainly limited for scalability reasons 11 Data Distribution In a MapReduce cluster data is distributed to all the nodes of the cluster as it is being loaded in An underlying distributed file systems e g GFS splits large data files into chunks which are managed by different nodes in the cluster Input data A large file Node 1 Node 2 Node 3 Chunk of input data Chunk of input data Chunk of input data Even though the file chunks are distributed across several machines they form a single namesapce 12 MapReduce A Bird s Eye View C0 C1 C2 C3 The outputs from the mappers are denoted asmappers M0 intermediate outputs IOs and are brought into a second set of tasks called Reducers IO0 M 1 M 2 M 3 IO1 IO2 IO3 R0 R1 FO 0 FO 1 chunks Map Phase In MapReduce chunks are processed in isolation by tasks called Mappers Reduce Phase The process of bringing together IOs into a setShuffling Data of Reducers is known as shuffling process Reducers The Reducers produce the final outputs FOs Overall MapReduce breaks the data flow into two phases map phase and reduce phase Keys and Values The programmer in MapReduce has to specify two functions the map function and the reduce function that implement the Mapper and the Reducer in a MapReduce program In MapReduce data elements are always structured as key value i e K V pairs The map and reduce functions receive and emit K V pairs Input Splits Intermediate Outputs Final Outputs K V Pairs Map Functio n K V Pairs Reduce Functio n K V Pairs Partitions In MapReduce intermediate output values are not usually reduced together All values with the same key are presented to a single Reducer together More specifically a different subset of intermediate key space is assigned to each Reducer These subsets are known as partitions Different colors represent different keys potentially from different Mappers Partitions are the input to Reducers MapReduce In this part the following concepts of MapReduce will be described Basics A close look at MapReduce data flow Additional functionality Scheduling and fault tolerance in MapReduce Comparison with existing techniques and models 16 Hadoop Since its debut on the computing stage MapReduce has frequently been associated with Hadoop Hadoop is an open source implementation of MapReduce and is currently enjoying wide popularity Hadoop presents MapReduce as an analytics engine and under the hood uses a distributed storage layer referred to as Hadoop Distributed File System HDFS HDFS mimics Google File System GFS 17 Distributed File Systems Highly scalable distributed file system for large data intensive applications E g 10K nodes 100 million files 10 PB Provides redundant storage of massive amounts of data on cheap and unreliable computers Files are replicated to handle hardware failure Detect failures and recovers from them Provides a platform over which other systems like MapReduce BigTable operate Distributed File System Single Namespace for entire cluster Data Coherency Write once read many access model Client can only append to existing files Files are broken up into blocks Typically 128 MB block size Each block replicated on multiple DataNodes Intelligent Client Client can find location of blocks Client accesses data directly from DataNode HDFS Architecture m a n e s e dNameNode e l o fi aN 1 t a D d I Secondary ck l B NameNode Client 2 o 3 Read data DataNode s NameNode Maps a file to a file id and list of MapNodes DataNode Maps a block id to a physical location on disk Hadoop MapReduce A Closer LookNode 1 Node 2 Files loaded from local HDFS store Files loaded from local HDFS store InputFormat file Split Split InputFormat Split Split Split Split file RecordReaders file RR RR RR RR RR RR Input K V pairs RecordReaders Input K V pairs Map Map Map Map
View Full Document