UMD CMSC 433 - Building Map Reduce - D2640112

Home> Schools> University of Maryland, College Park> Computer Science (CMSC) > CMSC 433> Building Map Reduce

DOC PREVIEW

UMD CMSC 433 - Building Map Reduce

School name University of Maryland, College Park

Course Cmsc 433- Programming Language Technologies and Paradigms

Pages 27

This preview shows page 1-2-3-25-26-27 out of 27 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 27 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 27 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 27 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 27 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 27 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 27 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 27 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

CMSC 433Building Map ReduceNovember 27th, 2007Rest of the semester•Open Source contribution project•due Dec 4th•Homework 5: a simple Hadoop project•due Dec 6th•Project 4: implementing Map Reduce from scratch•due Dec 11thBuilding Map Reduce•Didn’t want to do a big Hadoop project•don’t have a handy Hadoop cluster available•too much “learning Hadoop”, not enough learning basic concepts•So we are going to build Map Reduce from scratch•Hadoop programs pretty much work, after importing classes from a different packageFinal project•Bigger than other projects (about 1000 lines of code)•Not going to be spoon fed•leave a lot of the details up to you, much of the suggested implementation strategies are just suggestions•Not many unit tests, mostly end-to-end testsSimplifying assumptions•We have a distributed reliable file system•MapReduce is build on top of Google File System•We don’t worry about file system failure•We don’t have to worry about locality, all files are equidistant from all processors•Common/shared code base; don’t need to worry about shipping map/reduce code to cluster nodesTechnologies•Multicast, for JobRunner discovery•Server Sockets, Sockets, Object Serialization•for Master/JobRunner communication, result storage•Random access files, for splitting input filesMaster / JobRunner•One master node: job is submitted here•JobRunners: available to run any MapReduce tasks, announce their availability by broadcasting a ServerLocation every second•Master listens for announcements, waits until it has enough job runners or until 10 seconds elapseInput/Output formats•We will use text files as our input source•all files in a directory that have a name ending in .txt•handle all line ending encodings•Write output files into an output directory using Object serialization•a set of files, each of which consists of•key1, value1, key2, value2, ...Master inputs•# of file ranges tasks (lower bound)•# of partitons•# of job runner nodes•Mapper•Reducer•PartitionerJobRunner•Listens for connections on a server socket•Accepts a connection, uses an ObjectInputStream to read a task, executes it•send back a heart beat•false, false, false, ..., false, true•one value per secondMapper input•In theory, the input to the mapper could be anything•In practice, we are going to only handle splits of text files•InputKey is a FileIndex (contains a File and a position)•InputValue is a String (which starts at that file index)File Range•Describes a portion of a file•Each Map task is responsible for one file range•Consists of•File•index - start position, inclusive•end - end position, exclusive•Never spans multiple filesFile Splitting•OK, we have a set of files, each of variable length•We have a desired # of ranges to split it into•Consider the following situation:•10 splits requested•one file containing 1 gigabyte of data•9 files containing 100 bytes of dataUneven splits•If asked for m splits of n files•generate m full sized splits, •each no bigger than ceil (totalSize/m)•up to n small splits•Exact details up to youSplitting text files•For each range, include any lines of text that start in that rangeLines of textFile RangesHow are we going to do this efficiently?•Hint: having each Map task reading each line of text, starting at the beginning of the file, until it gets to the right position, is not efficientTasks•The tasks that sent to JobRunners are tasks•each has an id, a unique number•we will use the numbers to ensure that each task writes to distinct files, and to allow us to clean up or ignore output of failed jobsPartitioner•Given a key, returns the partition to which it should be sent•One partition per desired reduce taskPhase 1 files•Map task writes all generated key/value pairs, in sequence, to•phase1/{partition#}/{taskID}•Separate file for each partition•each key,value pair written to the file for its partitionReduce task•Assigned a partition•Read all files in phase1/{partition#}•exclude any files from failed jobs•Might be too big to fit all into memory•we are thinking “big”Sorting pass of reduce task•resort phase 1 files into phase 2 files•for any key, all instances of that key are in one phase 2 file•Do not store all phase 1 data in memoryPhase 2 files•Create a directory phase2/{partition#}.{taskID}•For each key,value read from a phase 1 file, write to •phase2/{partition#}.{taskID}/H{key.hashCode()}Reduce phase of reduce task•For each phase 2 file for my task•Read all key,value pairs from file•For each key, create a collection of associated values•invoke the reducer•write the generated key,value pair to the output file for this taskOutput files•Each Reduce task writes all of its output to•output/{taskID}How does the master keep tabs?•Create one thread per job runner•Have a circular queue of tasks that need running•Have a set of completed tasks•Have a set of potentially failed jobs•job gets added to the set when created•removed if it the the first to complete the taskTask work queue•When removing a task from the work queue•if the task is already completed•just skip it•otherwise, •add it back to the end of the work queue•ask task runner to complete it•When work queue empty, thread can stopHigh level picture•Master looks at lengths of input files•Master computes file ranges•Master looks for job runners•Master assigns Map tasks to job runners•Once map task for each file range is complete•Master assigns reduce tasks to job runners•one reduce task for each partition is complete•delete output of failed reduce

View Full Document