FSU CIS 5930 - I/O Efficient Implementation of MapReduce - D373354

Home> Schools> Florida State University> Computer Science (CIS) > CIS 5930> I/O Efficient Implementation of MapReduce

FSU CIS 5930 - I/O Efficient Implementation of MapReduce

Course Cis 5930- Advanced Topics in Data Management

Pages 4

Download Save

Unformatted text preview:

I/O Efficient Implementation of MapReduceCIS5930, Computer Science Department, FSUBackgroundMapReduce is a programming model and an associated implementation used by Google for processingtheir massive data sets. It has a simple yet powerful interface that is amenable to a broad varietyof problems. Since 2003, when the MapReduce framework was first created, more than ten thousanddistinct programs have been i mplemented under this model. A large number of MapReduce tasksare now running on Googles clust ers at any minute, processing huge amounts of data and g atheringlots of useful information. For instance, in the single month of September 2007, more than 2 millionMapReduce jobs have bee n completed, proces sing over 400,000 TB of input data [2]. The succe s s ofMapReduce stems from the fact that it is very easy for the programmer to write a simple program andrun it effi-ciently on a thousand machines, greatly improving the engineers productivity.In this project, (si nc e we dont have a thousand machines,) we will study the problem of howto implement the MapReduce interface effi c iently on a single machine. Since the data size could bemuch larger than memory, I/O-efficient techniques that you have learned from class wil l be useful (andrequired!).The MapReduce interfaceWe are going to use a simplified interface of MapReduce, which in fact allows for a m ore e fficientimplementation and also makes the problem more interesting. The input is simply a sequence ofelements, stored in a file on disk. The user sp e c ifies the computation through two functions: map andreduce.Map takes an input element and produces a set of intermediate (key, value) pairs. The MapReducelibrary groups together all int e rm e diate (key, value) pairs with the same key and passes them to thereduce function.Reduce accepts an interm e diate key and a set of values for that key. It combines these values toform a possibly smaller set of values and write them to an output file.In this project, we make the following simpl ifying assumptions: (a) the map function accepts oneinput el e ment and produces only o ne intermediate (key, value) pair; (b) the reduce function combinesall values associat e d with any particular key into one value using the same operator ♦; and (c) ♦ iscommutative and associative. Thus, the reduce function simply accepts two input values x and y, andthen returns x♦y. Furthermore, assumption (c) implies that the reduce function can be called i n anyorder to combine all the values associated with the same key.Finally, we assume that the ty pes (speci fie d by the user) for the elements, keys, and val ue s all havefixed sizes, e.g., double, int, or class consisting of these fixed-size types.Example: Counting frequencies of keywordsSuppose we have a large web query log storing the keyword for each query ever received. Assumefor simplicity that each query only has one keyword, and all keywords are of the same length, say 8charac-ters. One basic task in query log analysis is to count the number of occurrences of each keyword.The user could then write MapReduce functions as follows. The map function takes in a keyword andoutput (keyword, 1) as the (key, value ) pair; t he reduce function then simply defines the operator tobe addi-tion.Task 1: Design the MapReduce interfaceYou should design your MapReduce interface in a way that allows flexible user defined types for theinput eleme nts, intermediate keys, and values. (Hint: Use C++ templates).You will use the TPIE library for your implementation. TPIE defines the type “stream”, which shallbe used fo r all the input, output, and intermediate data. See the course web pa ge (unde r ProgrammingNotes) for more information about TPIE.Your library should have three entry points: (1) The user can do only map, in which case the userwill provide a stream of input eleme nts a nd his/her map function. The user also passes the po interto an empty stream to receive the intermediate (key, value) pairs. (2) The user can do only reduce, inwhich case the user will provide a stream of (key, value) pairs and his/her reduce function. The us e ralso passes the pointer to an empty stream to receive the output (key, value) pairs. This will also beuseful in Task 5. (3) The user can supply bot h map and reduce and do the entire computation via onecall. In this case the user only gives the input stream and a pointer to an empty stream to recei ve thefinal re-sults. Intermediate data should be disc a rded.Clearly specif y your user interface in C++. Any user should be able to use your library by readingonly the .h file that you provide.Hint: There are a number of ways for the user to provide userdefined functions (the map and reducefunctions in our case). You can use function pointers (the C-style), or ask the use r to pass an object ofa class that has map and reduce as its member functions (the C++-style). The latter is m o re efficientbecause it allows the functions to be inlined.Task 2: A simple implementationThe easiest way to implem e nt the MapReduce framework is scan + sort + scan: You first scan throughthe input stream, calling the map function for each input element, and then write the (key, value)pairs to an intermediate stream. Then so rt this stream by the keys, grouping all pairs with the samekey toget he r. Finally, scan through the sorted stream a ga in. Now all pairs sharing the same key areconsecutive, so you can simply call the reduce function to merge their values together. The final outputis a sequence of (key, value) pairs, one for each disti nc t key.Your t ask is to implement this simple solution, and then use it to do the keyword-counting job. LetK be the total number of distinct keywords. Try generating a few data sets with various K and differentdistributions (from uniform to highly skewed) . What do you observe from these experiments? Doesthe performance of your implementation fluctuate? A good model to generate skewed distributions ofkeywords is to follow the Zipfs Law [1]. Note that the size of your data set should be larger than thememory size (TPIE allows you to specify the amount of memory avail able to your program).You will use the TPIE library for your implementation. TPIE already provides sc an and s o rtprimitives, so you dont need to reimpl e ment them. See the course web page for more informationabout TPIE.Task 3: Improved implementationReal world data sets are typically not uniform : They could be highly skewed (a few keywords

View Full Document


School:
Email:
New Password:
Confirm Password:

FSU CIS 5930 - I/O Efficient Implementation of MapReduce

Sign up for free to view:

Please select your school