DOC PREVIEW
Berkeley COMPSCI 262A - MapReduce - Distributed Computing for Machine Learning

This preview shows page 1-2-3-4 out of 12 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 12 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 12 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 12 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 12 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 12 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

MapReduce: Distributed Computing for Machine LearningDan Gillick, Arlo Faria, John DeNeroDecember 18, 2006AbstractWe use Hadoop, an open-source implementation of Google’s distributed file sys tem and theMapReduce framework for distributed data processing, on mo destly-sized compute clusters toevaluate its efficacy for standard machine learning tasks. We show benchmark performanceon searching and sorting tasks to investigate the effects of various system configurations. Wealso distinguish classes of machine-learning problems that are reasonable to address within theMapReduce framework, and offer improvements to the Hadoop implementation. We concludethat MapReduce is a good choice for basic operations on large datasets, although there arecomplications to be addressed for more complex machine learning tasks .1 IntroductionThe Internet provides a resource for compiling enormous amounts of data, often beyond the capacityof individual disks, and too large for processing with a single CPU. Google’s MapReduce [3], builton top of the distributed Google File System [4] provides a parallelization framework which hasgarnered considerable acclaim for its ease-of-use, scalability, and fault-tolerance. As evidence of itsutility, Google points out that thousands of MapReduce applications have already been written byits employees in the short time since the system was deployed [3]. Recent conference papers haveinvestigated MapReduce applications for machine learning algorithms run on multicore machines[1] and MapReduce is discussed by systems luminary Jim Gray in a rec ent position paper [6]. Hereat Berkeley, there is even discussion of incorporating MapReduce programming into undergraduateComputer Science classes as an introduction to parallel computing.The success at Google prompted the development of the Hadoop project [2], an open-sourceattempt to reproduce Google’s implementation, hosted as a sub-project of the Apache SoftwareFoundation’s Lucene search engine library. Hadoop is still in early stages of development (latestversion: 0.92) and has not been tested on clusters of more than 1000 machines (Google, by contrast,often runs jobs on many thousands of machines). The discussion that follows addresses the practicalapplication of the MapReduce framework to our research interests.Section 2 introduces the compute environments at our disposal. Section 3 provides benchmarkperformance on standard MapReduce tasks. Section 4 outlines applications to a few relevant ma-chine learning problems, discussing the virtues and limitations of MapReduce. Section 5 discussesour improvements to the Hadoop implementation.12 Compute ResourcesWhile large companies may have thousands of networked computers, most academic researchershave access to more modest resources. We installed Hadoop on the NLP Millennium cluster inSoda Hall and at the International Computer Science Institute (ICSI), which are representative oftypical university research environments.Table 1 gives the details of the two cluster configurations used for the experiments in thispaper. The NLP cluster is a rack of high-performance computing servers, whereas the ICSI clustercomprises a much larger number of personal desktop machines. It should be noted that ICSI alsohas a compute cluster of dedicated servers: over 20 machines, each with up to 8 multicore processorsand 32 GB RAM, and a 1 Gbps link to terabyte RAID arrays. Compute jobs at ICSI are typicallyexported to these compute servers using pmake-customs software.1For this paper, we demonstratethat the desktop machines at ICSI can also be used for effective parallel programming, providingan especially useful alternative for I/O-intensive applications which would otherwise be crippled bythe pmake-customs network bottleneck.NLP Cluster ICSI Desktop ClusterMachines 9 80CPUs per machine 2x 3.4 GHz 1x 2.4 GHzRAM p e r machine 4 GB 1 GBStorage p e r machine 300 GB 30 GBNetwork Bandwidth 1 GBit/s 100 MBit/sTable 1: Two clusters used in this paper’s experiments.3 BenchmarksWe evaluate the performance of the 80-node ICSI desktop cluster to establish benchmark resultsfor given tasks and configurations. Following the presentation in [3], we consider two computationsthat typify the majority of MapReduce algorithms: a grep that searches through a large amount ofdata, and a sort that transforms it from one representation to another. These two tasks representextreme cases for the reduce phase, in which either none or all of the input data is shuffled acrossthe network and written as output.The data in these experiments are derived from the TeraSort benchmark [5], in which 100-byterecords comprise a random 10-byte sorting key and 90 bytes of “payload” data. In the comparableexperiments in [3], a cluster of 1800 machines was used to pro ce ss a terabyte of data; due to ourlimited resources, we scaled the experiments down to 10 gigabytes of data – i.e., our input was 108100-byte records.We display our results in the same format as [3], showing the data transfer rates over the timeof the computation, distinguishing the map phase (reading input from local disks) from the reducephase (shuffling intermediate data over the network and writing the output to the distributed filesystem).1This was Ada m de Boor’s CS 262 final project from 1989, and was updated by Andreas Stolcke for use at ICSIand SRI.2Figure 1: Effect of varying the size of map tasks. Left: with 600 map tas ks (16MB input splits),performance suffers from repeated initialization costs. Right: with 75 map tasks (128MB inputsplits), a high sustained read rate is attainable, but the coarse granularity does not allow goodload-balancing.3.1 Distributed GrepThe distributed grep program searches for a character pattern and writes intermediate data onlywhen a match is encountered. To simplify this further, we can consider an empty map function,which reads the input and never writes intermediate records. Using the Hadoop streaming protocol[2], which reads records from STDIN and writes to STDOUT, the map function is trivial:#!/usr/bin/perlwhile(<STDIN>){# do nothing}The reduce function is not needed since there is no intermediate data. Thus, this contrived programcan be used to measure the maximal input data read rate for the map phase.Figure 1 shows the effect of varying the number of map tasks (equivalent to varying the size of aninput split). When the number of map tasks is high, reasonable load balancing is possible becausefaster machines


View Full Document

Berkeley COMPSCI 262A - MapReduce - Distributed Computing for Machine Learning

Download MapReduce - Distributed Computing for Machine Learning
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view MapReduce - Distributed Computing for Machine Learning and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view MapReduce - Distributed Computing for Machine Learning 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?