Unformatted text preview:

Exploiting Heterogeneity in the Public Cloud for Cost Effective Data Analytics Gunho Lee Byung Gon Chun Randy H Katz University of California Berkeley Intel Labs Berkeley Abstract a data analytics cluster in the cloud should also take advantage of the dynamic nature of the cloud computing to be maintained in a cost effective manner The cost associativity 5 of cloud computing e g using 1 000 EC2 machines for 1 hour costs the same as using 1 machine for 1 000 hours is the key to realize the cost effective data analytics cluster in the cloud Data analytics are key applications running in the clouds The clouds provide dynamic provisioning that enables users to scale their clusters in response to the demand These clouds are comprised of heterogeneous server platforms but current data analytics applications do not consider these heterogeneous platforms In this paper we rethink resource allocation and scheduling on data analytics in the cloud to take advantage of the heterogeneity of the platforms 1 While allocating resources heterogeneity should be considered to exploit the full potential of cloud services Typically cloud services provide various machine types For example Amazon EC2 offers 11 instance types with different specs and prices One of them even allows access to a GPU that can be used to accelerate a particular application significantly 9 Hence we should allocate appropriate type of resources that fit to workload characteristics to achieve high performance per cost This will lead to a heterogeneous cluster which raises another interesting question Introduction As cloud computing services become more popular more services are run in a public cloud For example a number of social game developers use a cloud to serve their game applications as cloud computing provides fast deployment and easy scaling 4 Such applications generate a huge amount of data about in game activities and events and analyzing the data is important to monitor user behavior and improve game play through datadriven iterative development processes 3 Hence it is desirable to store data and perform data analytics in the same cloud where the service runs rather than moving the data to a separate data analytics cluster to minimize response time and cost of queries When we have a data analytics cluster in the cloud our primary interest might be to maintain the cluster with minimum cost while meeting the performance requirements To do so there are two key questions we need to address One is how to allocate resources to the data analytics cluster As the cloud computing provides fast provisioning we can easily add more machines to the cluster in minutes when the demand surges Similarly we can remove and terminate machines when the cluster is idle to reduce the cost to maintain the cluster As in other applications such as web services and storage systems 6 Given the heterogeneous nature of the data analytics cluster in the cloud the other question we need to address is how to schedule jobs and tasks in the cluster Especially if there are multiple jobs to run concurrently the data analytics system should be aware of the heterogeneity of resources and jobs then make scheduling decisions appropriately in a way that a job is assigned to preferred resources for performance It is also important to provide fairness among jobs so that each job receives a fair amount of resources and does not starve in the heterogeneous environment In this paper we present architecture and design consideration to build a cost effective data analytics system in the cloud We first present background on Hadoop Amazon EC2 and Data Analytic Cloud in Section 2 We present resource allocation and scheduling issues in heterogeneous environments and our system architecture that addresses the issues in Section 3 We illustrate the potential benefits of our approach with case studies in Section 4 and conclude in Section 5 1 2 Background Job This paper focuses on a data analytic which operates as a MapReduce cluster that relies on providers to host large data sets on a distributed storage system and performs data processing on an analytic engine in public clouds as described in 7 Particularly we consider the Hadoop Distributed File System HDFS and Hadoop MapReduce as a storage layer and an analytic engine in this paper Also we use Amazon EC2 as an example of a public cloud service HDFS is comprised of multiple DataNodes which store blocks and a NameNode which maintains metadata such as filenames and block locations A file in HDFS is divided into blocks Replicas of each block are stored on nodes that run DataNodes in the cluster By default the block size is 64MB and each block has three replicas On top of HDFS Hadoop MapReduce is used to process data Hadoop MapReduce has multiple TaskTrackers that perform actual computations on participating nodes and a JobTracker that manages jobs and TaskTrackers Users submit jobs that consist of a map function and a reduce function to the JobTracker which then launches map and reduce tasks on TaskTrackers Each map task reads and processes one block from HDFS to generate intermediate results which in turn fed into reduce tasks to produce the final result Each TaskTracker holds a fixed number of slots to host map tasks and reduce tasks Amazon EC2 is a public cloud service that enables users to lease virtual machines Amazon charges per hour and per machine for this public cloud service without requiring any long term commitments Users can choose the spec of their virtual machines from 11 instance types tagged with different prices Also users can launch virtual machines in multiple locations across the world EC2 charges for data transfer in and out of machines but all transfers are free between machines that are in the same location Hence there is strong economic incentive for users to keep network communication within the same location There are many other cloud providers with different pricing schemes but their basic structures are similar 3 Cloud Driver Analytics Engine MapReduce Storage HDFS Core Node Core Node Core Node Accelerator Node Figure 1 Data Analytic Cloud Architecture queries in a cost effective manner To that end we propose architecture to build a data analytics system in the cloud as seen in Figure 1 In this architecture participating nodes are grouped into one of two pools long lived core nodes to host both data and computations and accelerator nodes that are added to the cluster temporarily when additional


View Full Document

Berkeley COMPSCI C191 - Exploiting Heterogeneity in the Public Cloud for Cost-Effective Data Analytics

Documents in this Course
Load more
Loading Unlocking...
Login

Join to view Exploiting Heterogeneity in the Public Cloud for Cost-Effective Data Analytics and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Exploiting Heterogeneity in the Public Cloud for Cost-Effective Data Analytics and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?