DOC PREVIEW
Map Reduce Programming and Cost-based Optimization

This preview shows page 1 out of 4 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 4 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 4 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

MapReduce Programming and Cost-based Optimization?Crossing this Chasm with StarfishHerodotos HerodotouDuke [email protected] DongDuke [email protected] Babu∗Duke [email protected] has emerged as a viable competitor to database sys-tems in big data analytics. MapReduce programs are being writtenfor a wide variety of application domains including business dataprocessing, text analysis, natural language processing, Web graphand social network analysis, and computational science. However,MapReduce systems lack a feature that has been key to the histor-ical success of database systems, namely, cost-based optimization.A major challenge here is that, to the MapReduce system, a pro-gram consists of black-box map and reduce functions written insome programming language like C++, Java, Python, or Ruby.Starfish is a self-tuning system for big data analytics that in-cludes, to our knowledge, the first Cost-based Optimizer for simpleto arbitrarily complex MapReduce programs. Starfish also includesa Profiler to collect detailed statistical information from unmodifiedMapReduce programs, and a What-if Engine for fine-grained costestimation. This demonstration will present the profiling, what-if analysis, and cost-based optimization of MapReduce programsin Starfish. We will show how (nonexpert) users can employ theStarfish Visualizer to (a) get a deep understanding of a MapReduceprogram’s behavior during execution, (b) ask hypothetical ques-tions on how the program’s behavior will change when parametersettings, cluster resources, or input data properties change, and (c)ultimately optimize the program.1. INTRODUCTIONMapReduce is a relatively young framework—both a program-ming model and an associated run-time system—for large-scaledata processing [4]. Hadoop [5] is a popular open-source imple-mentation of MapReduce that many academic, government, andindustrial organizations use in production deployments. Hadoopis used for applications such as Web indexing, data mining, reportgeneration, log file analysis, machine learning, financial analysis,scientific simulation, and bioinformatics research. Cloud platformsmake MapReduce an attractive proposition for small organizationsthat need to process large datasets, but lack the computing and hu-man resources of a Google or Yahoo! to throw at the problem.∗Supported by NSF grant 0964560 and an AWS research grantPermission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee. Articles from this volume were invited to presenttheir results at The 37th International Conference on Very Large Data Bases,August 29th - September 3rd 2011, Seattle, Washington.Proceedings of the VLDB Endowment, Vol. 4, No. 12Copyright 2011 VLDB Endowment 2150-8097/11/08... $ 10.00.Elastic MapReduce, for example, is a hosted platform on the Ama-zon cloud where users can instantly provision Hadoop clusters toperform data-intensive tasks; paying only for the resources used.A MapReduce program p is run on input data d and cluster re-sources r as a MapReduce job j = hp, d, r, ci. c represents a set ofconfiguration parameter settings needed in order to fully specifyhow the job should execute on the cluster. Choices for settings in cinclude (but are not limited to):1. Degree of parallelism. The execution of j consists of runningparallel map and reduce tasks. These tasks may run in multiplewaves depending on the number of execution slots in r.2. The amount of memory to allocate to each map (reduce) task ofj to buffer its outputs (inputs).3. The settings for the multiphase external sorting used to groupmap-output values by key.4. Whether the output data from the map (reduce) tasks should becompressed before being written to disk.5. Whether a given combine function should be used to preaggre-gate map outputs before their transfer to reduce tasks.Hadoop has more than 190 configuration parameters out of which10-20 parameters can have significant impact on job performance.Today, the burden falls on the user who submits the MapReducejob to specify settings for all configuration parameters. For any pa-rameter whose value is not specified explicitly during job submis-sion, default values—either shipped with the system or specifiedby the system administrator—are used. Higher-level languages forMapReduce like HiveQL and Pig Latin have developed their ownhinting syntax for setting parameters.The impact of various parameters as well as their best settingsvary depending on the MapReduce program, input data, and clus-ter resource properties. Personal communication, our own experi-ence [1, 7], and plenty of anecdotal evidence on the Web indicatethat finding good configuration settings for MapReduce jobs is timeconsuming and requires extensive knowledge of system internals.To automate this process, we developed Starfish [7], a self-tuningsystem for big data analytics. Starfish, which is built on Hadoop,aims to enable Hadoop users and applications to get good perfor-mance automatically without any need on their part to understandand manipulate the many tuning knobs available. Starfish includesa Cost-based Optimizer to find good configuration settings auto-matically for arbitrary MapReduce programs. The Optimizer re-quires the use of two other components: a Profiler that instrumentsunmodified MapReduce programs dynamically to generate concisestatistical summaries of MapReduce job execution; and a What-ifEngine to reason about the impact of parameter configuration set-tings, as well as data and cluster resource properties, on the perfor-mance of MapReduce jobs.In this demonstration, we will present the uses and contributionsof each component in optimizing MapReduce program execution:• The information collected by the Profiler helps in understand-ing the job behavior as well as in diagnosing bottlenecks duringjob execution.• The What-if Engine can predict the performance of a MapRe-duce job j, allowing the user to study the effects of configura-tion parameters, cluster resources, and input data on the perfor-mance of j; without actually running j.• The Cost-based Optimizer can find the optimal


Map Reduce Programming and Cost-based Optimization

Download Map Reduce Programming and Cost-based Optimization
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Map Reduce Programming and Cost-based Optimization and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Map Reduce Programming and Cost-based Optimization 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?