DOC PREVIEW
Berkeley COMPSCI 268 - Improving MapReduce Performance in Heterogeneous Environments

This preview shows page 1-2-3-4-5 out of 14 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 14 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 14 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 14 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 14 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 14 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 14 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Improving MapReduce Performance in Heterogeneous EnvironmentsMatei Zaharia, Andy Konwinski, Anthony D. Joseph, Randy Katz, Ion StoicaUniversity of California, Berkeley{matei,andyk,adj,randy,stoica}@cs.berkeley.eduAbstractMapReduce is emerging as an important programmingmodel for large-scale data-parallel applications such asweb indexing, data mining, and scientific simulation.Hadoop is an open-source implementation of MapRe-duce enjoying wide adoption and is often used for shortjobs where low response time is critical. Hadoop’s per-formance is closely tied to its task scheduler, which im-plicitly assumes that cluster nodes are homogeneous andtasks make progress linearly, and uses these assumptionsto decide when to speculatively re-execute tasks that ap-pear to be stragglers. In practice, the homogeneity as-sumptions do not always hold. An especially compellingsetting where this occurs is a virtualized data center, suchas Amazon’s Elastic Compute Cloud (EC2). We showthat Hadoop’s scheduler can cause severe performancedegradation in heterogeneous environments. We designa new scheduling algorithm, Longest Approximate Timeto End (LATE), that is highly robust to heterogeneity.LATE can improve Hadoop response times by a factorof 2 in clusters of 200 virtual machines on EC2.1 IntroductionToday’s most popular computer applications are Internetservices with millions of users. The sheer volume of datathat these services work with has led to interest in paral-lel processing on commodity clusters. The leading exam-ple is Google, which uses its MapReduce framework toprocess 20 petabytes of data per day [1]. Other Internetservices, such as e-commerce websites and social net-works, also cope with enormous volumes of data. Theseservices generate clickstream data from millions of usersevery day, which is a potential gold mine for understand-ing access patterns and increasing ad revenue. Further-more, for each user action, a web application generatesone or two orders of magnitude more data in system logs,which are the main resource that developers and opera-tors have for diagnosing problems in production.The MapReduce model popularized by Google is veryattractive for ad-hoc parallel processing of arbitrary data.MapReduce breaks a computation into small tasks thatrun in parallel on multiple machines, and scales easily tovery large clusters of inexpensive commodity comput-ers. Its popular open-source implementation, Hadoop[2], was developed primarily by Yahoo, where it runsjobs that produce hundreds of terabytes of data on at least10,000 cores [4]. Hadoop is also used at Facebook, Ama-zon, and Last.fm [5]. In addition, researchers at Cornell,Carnegie Mellon, University of Maryland and PARC arestarting to use Hadoop for seismic simulation, naturallanguage processing, and mining web data [5, 6].A key benefit of MapReduce is that it automaticallyhandles failures, hiding the complexity of fault-tolerancefrom the programmer. If a node crashes, MapReduce re-runs its tasks on a different machine. Equally impor-tantly, if a node is available but is performing poorly,a condition that we call a straggler, MapReduce runs aspeculative copy of its task (also called a “backup task”)on another machine to finish the computation faster.Without this mechanism of speculative execution1, a jobwould be as slow as the misbehaving task. Stragglers canarise for many reasons, including faulty hardware andmisconfiguration. Google has noted that speculative ex-ecution can improve job response times by 44% [1].In this work, we address the problem of how to ro-bustly perform speculative execution to maximize per-formance. Hadoop’s scheduler starts speculative tasksbased on a simple heuristic comparing each task’sprogress to the average progress. Although this heuristicworks well in homogeneous environments where strag-glers are obvious, we show that it can lead to severe per-formance degradation when its underlying assumptionsare broken. We design an improved scheduling algorithmthat reduces Hadoop’s response time by a factor of 2.An especially compelling environment where1Not to be confused with speculative execution at the OS or hard-ware level for branch prediction, as in Speculator [11].Hadoop’s scheduler is inadequate is a virtualized datacenter. Virtualized “utility computing” environments,such as Amazon’s Elastic Compute Cloud (EC2) [3], arebecoming an important tool for organizations that mustprocess large amounts of data, because large numbersof virtual machines can be rented by the hour at lowercosts than operating a data center year-round (EC2’scurrent cost is $0.10 per CPU hour). For example,the New York Times rented 100 virtual machines for aday to convert 11 million scanned articles to PDFs [7].Utility computing environments provide an economicadvantage (paying by the hour), but they come with thecaveat of having to run on virtualized resources withuncontrollable variations in performance. We also ex-pect heterogeneous environments to become common inprivate data centers, as organizations often own multiplegenerations of hardware, and data centers are starting touse virtualization to simplify management and consoli-date servers. We observed that Hadoop’s homogeneityassumptions lead to incorrect and often excessive spec-ulative execution in heterogeneous environments, andcan even degrade performance below that obtained withspeculation disabled. In some experiments, as many as80% of tasks were speculatively executed.Na¨ıvely, one might expect speculative execution to bea simple matter of duplicating tasks that are sufficientlyslow. In reality, it is a complex issue for several reasons.First, speculative tasks are not free – they compete forcertain resources, such as the network, with other run-ning tasks. Second, choosing the node to run a specula-tive task on is as important as choosing the task. Third, ina heterogeneous environment, it may be difficult to dis-tinguish between nodes that are slightly slower than themean and stragglers. Finally, stragglers should be identi-fied as early as possible to reduce response times.Starting from first principles, we design a simple al-gorithm for speculative execution that is robust to het-erogeneity and highly effective in practice. We call ouralgorithm LATE for Longest Approximate Time to End.LATE is based on three principles: prioritizing tasks tospeculate, selecting fast nodes to run on, and cappingspeculative tasks to prevent


View Full Document

Berkeley COMPSCI 268 - Improving MapReduce Performance in Heterogeneous Environments

Documents in this Course
Lecture 8

Lecture 8

33 pages

L-17 P2P

L-17 P2P

50 pages

Multicast

Multicast

54 pages

Load more
Download Improving MapReduce Performance in Heterogeneous Environments
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Improving MapReduce Performance in Heterogeneous Environments and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Improving MapReduce Performance in Heterogeneous Environments 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?