Unformatted text preview:

MapReduce Algorithms 15-319, spring 2010 18th Lecture, March 18th Lecture GoalsLecture OutlineThinking in MapReduceSorting in MapReduceSorting AlgorithmSorting with a global orderProgram Structure in HadoopSorting : ConclusionsSearching in MapReduceSearching in MapReduce : AlgorithmSearching Program StructureSearch Algorithm : ObservationsIndexingInputs and AlgorithmIndexing Data FlowInverted Index Program StructureThe MapReduce LibraryProgram Design Strategies IProgram Design Strategies IICarnegie MellonSpring 2010 ©15-319 Introduction to Cloud ComputingIntroduction to Cloud ComputingMajd F. SakrMapReduce Algorithms 15‐319, spring 2010 18th Lecture, March 18thCarnegie MellonSpring 2010 ©15-319 Introduction to Cloud Computing2Lecture Goals Look At Examples of Algorithms Implemented in MapReduce Understand methods of designing algorithms in MapReduceCarnegie MellonSpring 2010 ©15-319 Introduction to Cloud ComputingLecture Outline Algorithms Sorting Searching Indexing Design StrategiesCarnegie MellonSpring 2010 ©15-319 Introduction to Cloud ComputingThinking in MapReduce As with functional programming, a change in mindset is required when developing algorithms for MapReduce We focus on data and not instructions Express your computation in terms of maps and reduces Hadoop programs represent Data flow rather than a procedure. Take advantage of implicit operations in MapReduce to do difficult tasksCarnegie MellonSpring 2010 ©15-319 Introduction to Cloud ComputingSorting in MapReduce Remember that MapReduce has a sort/shuffle stage after the map operation and before the reduce operation. The Shuffle stage sorts map outputs by key and sends it to the reducers Preparing the Input Set of Files with one value per line Mapper key is file name, line number Mapper value is the contents of the line Write pseudocode (including the mapper and reducer code) to sort n values stored in a text file, one value per line. Try It!5Carnegie MellonSpring 2010 ©15-319 Introduction to Cloud ComputingSorting Algorithm Take advantage of the reducer property – (key,value) pairs are sorted by key before sending to the individual reducers. Mapper: Use a mapper that transforms each value to a key Map(k,v) ‐> (v,_) Reducer: Identity Reducer Reduce (k’,_) ‐> (k’,””) You may use the reducer to handle duplicates/erronous input pairs as well. 6Carnegie MellonSpring 2010 ©15-319 Introduction to Cloud ComputingSorting with a global order (key,value) pairs inside a reducer are sorted by key for a particular reducer. There is only a local order For a globally ordered sort either: Use one reducer (slow) Or choose a correct hash function in the partitioner such thatK1 < K2 => hash(K1) < hash (K2)7Figure © Cloudera, Inc.Carnegie MellonSpring 2010 ©15-319 Introduction to Cloud ComputingProgram Structure in HadoopSet input files, key,value typesSet output files, key,value typesSet Mapper -> map(k,v) -> (v,_)Set Reducer -> reduce(k’,_) -> (k’, “”)Set HashPartitioner to appropriate hash functionRun Job!8Carnegie MellonSpring 2010 ©15-319 Introduction to Cloud ComputingSorting : Conclusions One of the easiest things to do in Hadoop! The framework takes care of everything for you.  You specify how the input looks like, and how you want the output format to be. Also one of the fastest distributed sorting algorithms –a 910 node Yahoo! cluster won the TeraSort challenge by sorting a terabyte of data in 209 seconds. 1800 maps and 1800 reduce operations Extra optimizations to reduce the intermediate writes to disk Packaged in Hadoop as the TeraSort example. Bottleneck in Distributed Sorting: I/O How fast can you move data around?9Carnegie MellonSpring 2010 ©15-319 Introduction to Cloud ComputingSearching in MapReduce Pop quiz You care given a file with text and you would like to sea r ch for a particular pattern. Given the file and pattern, write pseudocode to a MapReduce algorithm that outputs each instance of pattern in file10Carnegie MellonSpring 2010 ©15-319 Introduction to Cloud ComputingSearching in MapReduce : Algorithm Input A set of files containing lines of text A search pattern to find Mapper Input Mapper key is file name, line number Mapper value is the contents of the line Search pattern is sent as special parameter to the mapper Algorithm Mapper Given (filename, some text) and “pattern”, if “text” matches “pattern” output (filename, _) Reducer: Identity functionCarnegie MellonSpring 2010 ©15-319 Introduction to Cloud ComputingSearching Program StructureSet input files, key,value typesSet output files, key,value typesSend pattern as a special argument to the mappterSet Mapper -> map(k,v)=if(v matches pattern)output(filename,_);Set Reducer -> Identity ReducerRun Job!12Carnegie MellonSpring 2010 ©15-319 Introduction to Cloud ComputingSearch Algorithm : Observations We effectively implement parallel search using the MapReduce Framework We use the Identity Reducer Optimization: If we find a file to be interesting, we need to emit it only once. Use a combiner to fold multiple hits in a file to a single (key, value) pair You can reduce network bandwidth and get a much faster search engineCarnegie MellonSpring 2010 ©15-319 Introduction to Cloud ComputingIndexing Indexing is a common operation in web search engines Build in index of words from a set of documents and the files they belong to A sample inde xing project is given to you to build the index of Shakespeare’s work Makes it much easier to locate a particular item/document based on its content Inverted Indexing: Map each word to the name of the file it appears inCarnegie MellonSpring 2010 ©15-319 Introduction to Cloud ComputingInputs and Algorithm Inputs: Set of files containing lines of text Mapper Key: Filename, line number Mapper Value: Contents of the Line Algorithm: Mapper: For each word in (file,words) map to (word,file) Reducer: Identity ReducerCarnegie MellonSpring 2010 ©15-319 Introduction to Cloud ComputingIndexing Data FlowFigure © Cloudera, Inc.Carnegie MellonSpring 2010 ©15-319 Introduction to Cloud ComputingInverted Index Program StructureSet input files, key,value typesSet output files, key,value typesSet map(pageName,pageText)=foreach word w in


View Full Document

CMU CS 15319 - Lecture

Download Lecture
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?