CMU CS 15319 - Lecture - D419958

Home> Schools> Carnegie Mellon University> Computer Science (CS) > CS 15319> Lecture

CMU CS 15319 - Lecture

Pages 20

Download Save

Unformatted text preview:

MapReduce Algorithms 15-319, spring 2010 18th Lecture, March 18th Lecture GoalsLecture OutlineThinking in MapReduceSorting in MapReduceSorting AlgorithmSorting with a global orderProgram Structure in HadoopSorting : ConclusionsSearching in MapReduceSearching in MapReduce : AlgorithmSearching Program StructureSearch Algorithm : ObservationsIndexingInputs and AlgorithmIndexing Data FlowInverted Index Program StructureThe MapReduce LibraryProgram Design Strategies IProgram Design Strategies IICarnegie MellonSpring 2010 ©15-319 Introduction to Cloud ComputingIntroduction to Cloud ComputingMajd F. SakrMapReduce Algorithms 15‐319, spring 2010 18th Lecture, March 18thCarnegie MellonSpring 2010 ©15-319 Introduction to Cloud Computing2Lecture Goals Look At Examples of Algorithms Implemented in MapReduce Understand methods of designing algorithms in MapReduceCarnegie MellonSpring 2010 ©15-319 Introduction to Cloud ComputingLecture Outline Algorithms Sorting Searching Indexing Design StrategiesCarnegie MellonSpring 2010 ©15-319 Introduction to Cloud ComputingThinking in MapReduce As with functional programming, a change in mindset is required when developing algorithms for MapReduce We focus on data and not instructions Express your computation in terms of maps and reduces Hadoop programs represent Data flow rather than a procedure. Take advantage of implicit operations in MapReduce to do difficult tasksCarnegie MellonSpring 2010 ©15-319 Introduction to Cloud ComputingSorting in MapReduce Remember that MapReduce has a sort/shuffle stage after the map operation and before the reduce operation. The Shuffle stage sorts map outputs by key and sends it to the reducers Preparing the Input Set of Files with one value per line Mapper key is file name, line number Mapper value is the contents of the line Write pseudocode (including the mapper and reducer code) to sort n values stored in a text file, one value per line. Try It!5Carnegie MellonSpring 2010 ©15-319 Introduction to Cloud ComputingSorting Algorithm Take advantage of the reducer property – (key,value) pairs are sorted by key before sending to the individual reducers. Mapper: Use a mapper that transforms each value to a key Map(k,v) ‐> (v,_) Reducer: Identity Reducer Reduce (k’,_) ‐> (k’,””) You may use the reducer to handle duplicates/erronous input pairs as well. 6Carnegie MellonSpring 2010 ©15-319 Introduction to Cloud ComputingSorting with a global order (key,value) pairs inside a reducer are sorted by key for a particular reducer. There is only a local order For a globally ordered sort either: Use one reducer (slow) Or choose a correct hash function in the partitioner such thatK1 < K2 => hash(K1) < hash (K2)7Figure © Cloudera, Inc.Carnegie MellonSpring 2010 ©15-319 Introduction to Cloud ComputingProgram Structure in HadoopSet input files, key,value typesSet output files, key,value typesSet Mapper -> map(k,v) -> (v,_)Set Reducer -> reduce(k’,_) -> (k’, “”)Set HashPartitioner to appropriate hash functionRun Job!8Carnegie MellonSpring 2010 ©15-319 Introduction to Cloud ComputingSorting : Conclusions One of the easiest things to do in Hadoop! The framework takes care of everything for you.  You specify how the input looks like, and how you want the output format to be. Also one of the fastest distributed sorting algorithms –a 910 node Yahoo! cluster won the TeraSort challenge by sorting a terabyte of data in 209 seconds. 1800 maps and 1800 reduce operations Extra optimizations to reduce the intermediate writes to disk Packaged in Hadoop as the TeraSort example. Bottleneck in Distributed Sorting: I/O How fast can you move data around?9Carnegie MellonSpring 2010 ©15-319 Introduction to Cloud ComputingSearching in MapReduce Pop quiz You care given a file with text and you would like to sea r ch for a particular pattern. Given the file and pattern, write pseudocode to a MapReduce algorithm that outputs each instance of pattern in file10Carnegie MellonSpring 2010 ©15-319 Introduction to Cloud ComputingSearching in MapReduce : Algorithm Input A set of files containing lines of text A search pattern to find Mapper Input Mapper key is file name, line number Mapper value is the contents of the line Search pattern is sent as special parameter to the mapper Algorithm Mapper Given (filename, some text) and “pattern”, if “text” matches “pattern” output (filename, _) Reducer: Identity functionCarnegie MellonSpring 2010 ©15-319 Introduction to Cloud ComputingSearching Program StructureSet input files, key,value typesSet output files, key,value typesSend pattern as a special argument to the mappterSet Mapper -> map(k,v)=if(v matches pattern)output(filename,_);Set Reducer -> Identity ReducerRun Job!12Carnegie MellonSpring 2010 ©15-319 Introduction to Cloud ComputingSearch Algorithm : Observations We effectively implement parallel search using the MapReduce Framework We use the Identity Reducer Optimization: If we find a file to be interesting, we need to emit it only once. Use a combiner to fold multiple hits in a file to a single (key, value) pair You can reduce network bandwidth and get a much faster search engineCarnegie MellonSpring 2010 ©15-319 Introduction to Cloud ComputingIndexing Indexing is a common operation in web search engines Build in index of words from a set of documents and the files they belong to A sample inde xing project is given to you to build the index of Shakespeare’s work Makes it much easier to locate a particular item/document based on its content Inverted Indexing: Map each word to the name of the file it appears inCarnegie MellonSpring 2010 ©15-319 Introduction to Cloud ComputingInputs and Algorithm Inputs: Set of files containing lines of text Mapper Key: Filename, line number Mapper Value: Contents of the Line Algorithm: Mapper: For each word in (file,words) map to (word,file) Reducer: Identity ReducerCarnegie MellonSpring 2010 ©15-319 Introduction to Cloud ComputingIndexing Data FlowFigure © Cloudera, Inc.Carnegie MellonSpring 2010 ©15-319 Introduction to Cloud ComputingInverted Index Program StructureSet input files, key,value typesSet output files, key,value typesSet map(pageName,pageText)=foreach word w in

View Full Document


School:
Email:
New Password:
Confirm Password:

CMU CS 15319 - Lecture

Sign up for free to view:

Please select your school