New version page

CMU CS 15319 - lecture

Upgrade to remove ads

This preview shows page 1-2-3-24-25-26-27-49-50-51 out of 51 pages.

Save
View Full Document
Premium Document
Do you want full access? Go Premium and unlock all 51 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 51 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 51 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 51 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 51 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 51 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 51 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 51 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 51 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 51 pages.
Access to all documents
Download any document
Ad free experience

Upgrade to remove ads
Unformatted text preview:

MapReduce and Hadoop 15-319, spring 2010 17th Lecture, Mar 16th Lecture GoalsLecture OutlineFunctional Programming ReviewFunctional ProgrammingImplicit Parallelism in Functional ProgrammingEnter MapReduceMapReduce HistoryMotivating ExampleMapReduce FeaturesTypical Problem Solved by MapReduceProgramming ModelSlide Number 13A Typical MapReduce Program - IIParallelismData LocalityFault ToleranceOptimizationsTime to Dive into Hadoop!What is Hadoop?Quick Look at Hadoop ComponentsHadoop MapReduceMapReduce Programming ModelThe Map FunctionExample – UpperCase MapperExample – Filter Mapper The Reduce FunctionExample – Sum ReducerHadoop MapReduce TerminologyExample of TerminologyMapReduce Java Package StructureThe Mapper Java ClassInput/Output DatatypesThe Reducer Java ClassThe Reducer Java Class Continued…The OutputCollector and Reporter ClassesPutting it all together in the Driver ClassSlide Number 38A Simple Program (Hands-On)How does a Job Execute in Hadoop?Job Configuration in HadoopData Flow from Input to MappersHandling InputHandling InputData Flow from Mappers to ReducersReducer in DetailThe Partitioner ClassHandling OutputOptimizations for a MapReduce JobConclusionsReferencesCarnegie MellonSpring 2010 ©15-319 Introduction to Cloud ComputingIntroduction to Cloud ComputingMajd F. SakrMapReduce and Hadoop 15‐319, spring 2010 17th Lecture, Mar 16thCarnegie MellonSpring 2010 ©15-319 Introduction to Cloud Computing2Lecture Goals Transition to MapReduce from Functional Programming Understand the origins of MapReduce Explore the Hadoop API and Interface One hands‐on program to dive into Hadoop!Carnegie MellonSpring 2010 ©15-319 Introduction to Cloud ComputingLecture Outline Functional Programming Review and MapReduce Hadoop Components MapReduce Programming Model Map and Reduce Functions Terminology –Jobs and Tasks Java Implementations and Classes Mapper and Reducer Classes Writing a MapReduce Driver Input, Output and Other handler classes Simple Program (Hands‐On) Details of Hadoop Handling Input and Output Job Execution and Data Flow Hadoop OptimizationsCarnegie MellonSpring 2010 ©15-319 Introduction to Cloud ComputingFunctional Programming Review Functional operations do not modify data structures: They always create new ones  Original data still exists in unmodified form Data flows are implicit in program design Order of operations does not matterCarnegie MellonSpring 2010 ©15-319 Introduction to Cloud ComputingFunctional Programming Process lists of data very frequently Iterators: operators over lists Map Fold Map operation can be assigned to each element of a list independently If the operation performed in a fold operation is commutative, it can be parallelized5Carnegie MellonSpring 2010 ©15-319 Introduction to Cloud ComputingImplicit Parallelism in Functional Programming In a purely functional setting, elements of a list being computed by map cannot see the effects of the computations on other elements If order of application of f to elements in list is commutative (and associative), we can reorder or parallelize execution of f This is the “secret” that MapReduce exploitsCarnegie MellonSpring 2010 ©15-319 Introduction to Cloud ComputingEnter MapReduce Functional programming on top of distributed processing Motivated by need to process large amounts of data using hundreds (or thousands) of processor nodes Data is central to the computation, place the program closest to the data it will be working on.  Provide a clean abstraction to programmers similar to functional programming The Interface deals with all the messy detailsCarnegie MellonSpring 2010 ©15-319 Introduction to Cloud ComputingMapReduce History Developed by Google to simplify their data processing jobs on large data Details emerged from two published papers: James Dean, Sanjay Ghemawat, MapReduce : Simplified Data Processing on Large Clusters, Proceedings of OSDI ‘04, 2004 Sanjay Ghemawat, Howard Gobioff, and Shun‐Tak Leung, Google File System, Proceedings of Symposium of Operating Systems Principles, ACM SIGOPS, 2004 Since Google’s implementation is proprietary and not available to public, an Apache Project called Hadoop emerged as an open source implementation Primary Contributors: Yahoo! , Facebook8Carnegie MellonSpring 2010 ©15-319 Introduction to Cloud ComputingMotivating Example Processing Web Data on a Single Machine 20+ billion web pages x 20KB = 400+ terabytes One computer can read 30‐35 MB/sec from disk ~ four months to read the web ~1,000 hard drives just to store the web Even more to do something with the data Takes too long on a single machine, but with 1000 machines? < 3 hours to perform on 1000 machines But how long to program? What about the overheads? Communication, coordination, recovery from machine failure Status Reporting, Debugging, Optimization, Locality Reinventing the Wheel: This has to be done for every program!9Carnegie MellonSpring 2010 ©15-319 Introduction to Cloud ComputingMapReduce Features Automatic Parallelization and Distribution of Work Fault‐Tolerant Status and Monitoring Tools Clean Abstraction for Programmers10Carnegie MellonSpring 2010 ©15-319 Introduction to Cloud ComputingTypical Problem Solved by MapReduce1. Read a lot of Data2. MAP: extract something you need from each record3. Shuffle and Sort4. REDUCE: aggregate, summarize, filter or transform5. Write the results11Outline stays the same,Map and Reduce change to fit the ProblemModel seems restrictive but it is Turing Complete. Multiple maps and reduces needed to solve a complex problem.Carnegie MellonSpring 2010 ©15-319 Introduction to Cloud ComputingProgramming Model Fundamentally similar to Functional Programming Users implement interfaces to following two functions: map (in_key, in_value) -> (out_key, intermediate_value) list reduce (out_key, intermediate_value list) ->out_value list12Carnegie MellonSpring 2010 ©15-319 Introduction to Cloud ComputingMAPK2K1K2reduce reduceK1Shuffle and Sort (barrier)K1 K1 K1K1 K1K1K1K2 K2 K2K2 K2 K2 K2Input DataA Typical MapReduce Program ‐ ICarnegie MellonSpring 2010 ©15-319 Introduction to Cloud ComputingA Typical MapReduce Program ‐ II14Data store 1 Data store nmap(key 1, values...)(key 2, values...)(key 3,


View Full Document
Download lecture
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view lecture and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view lecture 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?