UT Dallas CS 6350 - 03. MapReduceFunnySlideBigData#2 - D3094404

Home> Schools> University of Texas at Dallas> Computer Science (CS) > CS 6350> 03. MapReduceFunnySlideBigData#2

DOC PREVIEW

UT Dallas CS 6350 - 03. MapReduceFunnySlideBigData#2

School name University of Texas at Dallas

Course Cs 6350- Big Data Management and Analytics

Pages 30

This preview shows page 1-2-14-15-29-30 out of 30 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 30 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 30 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 30 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 30 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 30 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 30 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 30 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Slide 1Sam’s MotherOne dayNext Day18 Years LaterBrave SamAfterwardsThat’s All Folks!Slide 9Commodity ClustersIsolated TasksData DistributionMapReduce: A Bird’s-Eye ViewKeys and ValuesPartitionsMapReduceHadoopDistributed File SystemsDistributed File SystemSlide 20Hadoop MapReduce: A Closer LookInput FilesInputFormatInputFormat TypesInput SplitsRecordReaderMapper and ReducerPartitionerSortOutputFormatMapReduceThe Story of SamSaliya EkanayakeSALSA HPC Group http://salsahpc.indiana.eduPervasive Technology Institute, Indiana University, BloomingtonBelieved “an apple a day keeps a doctor away”Sam’s MotherMotherSamAn AppleSam thought of “drinking” the appleOne dayHe used a to cut the and a to make juice.(map ‘( )) ( ) Sam applied his invention to all the fruits he could find in the fruit basketNext Day(reduce ‘( )) Classical Notion of MapReduce in Functional ProgrammingClassical Notion of MapReduce in Functional ProgrammingA list of values mapped into another list of values, which gets reduced into a single value18 Years LaterSam got his first job in JuiceRUs for his talent in making juiceNow, it’s not just one basket but a whole container of fruitsAlso, they produce a list of juice types separatelyFruitsNOT ENOUGH !!But, Sam had just ONE and ONE Large data and list of values for outputWait!Implemented a parallel version of his innovation Brave Sam(<a, > , <o, > , <p, > , …)Each input to a map is a list of <key, value> pairsEach output of a map is a list of <key, value> pairs(<a’, > , <o’, > , <p’, > , …)Grouped by keyEach input to a reduce is a <key, value-list> (possibly a list of these, depending on the grouping/hashing mechanism)e.g. <a’, ( …)>Reduced into a list of valuesThe idea of MapReduce in Data Intensive ComputingThe idea of MapReduce in Data Intensive ComputingA list of <key, value> pairs mapped into another list of <key, value> pairs which gets grouped by the key and reduced into a list of valuesSam realized,◦To create his favorite mix fruit juice he can use a combiner after the reducers◦If several <key, value-list> fall into the same group (based on the grouping/hashing algorithm) then use the blender (reducer) separately on each of them◦The knife (mapper) and blender (reducer) should not contain residue after use – Side Effect Free◦In general reducer should be associative and commutativeAfterwardsWe think Sam was you That’s All Folks!Hadoop: Nuts and BoltsBig Data Mining and AnalyticsDr. Latifur KhanDepartment of Computer ScienceUniversity of Texas at DallasSource:http://developer.yahoo.com/hadoop/tutorial/module4.htmlCommodity ClustersMapReduce is designed to efficiently process large volumes of data by connecting many commodity computers together to work in parallelA theoretical 1000-CPU machine would cost a very large amount of money, far more than 1000 single-CPU or 250 quad-core machinesMapReduce ties smaller and more reasonably priced machines together into a single cost-effective commodity cluster10Isolated TasksMapReduce divides the workload into multiple independent tasks and schedule them across cluster nodesA work performed by each task is done in isolation from one anotherThe amount of communication which can be performed by tasks is mainly limited for scalability reasons11Data DistributionIn a MapReduce cluster, data is distributed to all the nodes of the cluster as it is being loaded inAn underlying distributed file systems (e.g., GFS) splits large data files into chunks which are managed by different nodes in the clusterEven though the file chunks are distributed across several machines, they form a single namesapce12Input data: A large fileNode 1Chunk of input dataNode 2Chunk of input dataNode 3Chunk of input dataMapReduce: A Bird’s-Eye ViewIn MapReduce, chunks are processed in isolation by tasks called MappersThe outputs from the mappers are denoted as intermediate outputs (IOs) and are brought into a second set of tasks called Reducers The process of bringing together IOs into a set of Reducers is known as shuffling processThe Reducers produce the final outputs (FOs)Overall, MapReduce breaks the data flow into two phases, map phase and reduce phaseC0 C1 C2 C3M0M1M2M3IO0 IO1 IO2 IO3R0 R1FO0FO1chunksmappersReducersMap PhaseReduce PhaseShuffling DataKeys and ValuesThe programmer in MapReduce has to specify two functions, the map function and the reduce function that implement the Mapper and the Reducer in a MapReduce programIn MapReduce data elements are always structured as key-value (i.e., (K, V)) pairsThe map and reduce functions receive and emit (K, V) pairs(K, V) PairsMap Function(K’, V’) PairsReduce Function(K’’, V’’) PairsInput Splits Intermediate Outputs Final OutputsPartitionsIn MapReduce, intermediate output values are not usually reduced togetherAll values with the same key are presented to a single Reducer togetherMore specifically, a different subset of intermediate key space is assigned to each ReducerThese subsets are known as partitionsDifferent colors represent different keys (potentially) from different MappersPartitions are the input to ReducersMapReduceIn this part, the following concepts of MapReduce will be described:BasicsA close look at MapReduce data flowAdditional functionalityScheduling and fault-tolerance in MapReduceComparison with existing techniques and models16HadoopSince its debut on the computing stage, MapReduce has frequently been associated with Hadoop Hadoop is an open source implementation of MapReduce and is currently enjoying wide popularityHadoop presents MapReduce as an analytics engine and under the hood uses a distributed storage layer referred to as Hadoop Distributed File System (HDFS)HDFS mimics Google File System (GFS)17Distributed File SystemsHighly scalable distributed file system for large data-intensive applications.◦E.g. 10K nodes, 100 million files, 10 PBProvides redundant storage of massive amounts of data on cheap and unreliable computers◦Files are replicated to handle hardware failure◦Detect failures and recovers from themProvides a platform over which other systems like

View Full Document