MapReduce Distributed Computing for Machine Learning Dan Gillick Arlo Faria John DeNero December 18 2006 Abstract We use Hadoop an open source implementation of Google s distributed file system and the MapReduce framework for distributed data processing on modestly sized compute clusters to evaluate its efficacy for standard machine learning tasks We show benchmark performance on searching and sorting tasks to investigate the effects of various system configurations We also distinguish classes of machine learning problems that are reasonable to address within the MapReduce framework and offer improvements to the Hadoop implementation We conclude that MapReduce is a good choice for basic operations on large datasets although there are complications to be addressed for more complex machine learning tasks 1 Introduction The Internet provides a resource for compiling enormous amounts of data often beyond the capacity of individual disks and too large for processing with a single CPU Google s MapReduce 3 built on top of the distributed Google File System 4 provides a parallelization framework which has garnered considerable acclaim for its ease of use scalability and fault tolerance As evidence of its utility Google points out that thousands of MapReduce applications have already been written by its employees in the short time since the system was deployed 3 Recent conference papers have investigated MapReduce applications for machine learning algorithms run on multicore machines 1 and MapReduce is discussed by systems luminary Jim Gray in a recent position paper 6 Here at Berkeley there is even discussion of incorporating MapReduce programming into undergraduate Computer Science classes as an introduction to parallel computing The success at Google prompted the development of the Hadoop project 2 an open source attempt to reproduce Google s implementation hosted as a sub project of the Apache Software Foundation s Lucene search engine library Hadoop is still in early stages of development latest version 0 92 and has not been tested on clusters of more than 1000 machines Google by contrast often runs jobs on many thousands of machines The discussion that follows addresses the practical application of the MapReduce framework to our research interests Section 2 introduces the compute environments at our disposal Section 3 provides benchmark performance on standard MapReduce tasks Section 4 outlines applications to a few relevant machine learning problems discussing the virtues and limitations of MapReduce Section 5 discusses our improvements to the Hadoop implementation 1 2 Compute Resources While large companies may have thousands of networked computers most academic researchers have access to more modest resources We installed Hadoop on the NLP Millennium cluster in Soda Hall and at the International Computer Science Institute ICSI which are representative of typical university research environments Table 1 gives the details of the two cluster configurations used for the experiments in this paper The NLP cluster is a rack of high performance computing servers whereas the ICSI cluster comprises a much larger number of personal desktop machines It should be noted that ICSI also has a compute cluster of dedicated servers over 20 machines each with up to 8 multicore processors and 32 GB RAM and a 1 Gbps link to terabyte RAID arrays Compute jobs at ICSI are typically exported to these compute servers using pmake customs software 1 For this paper we demonstrate that the desktop machines at ICSI can also be used for effective parallel programming providing an especially useful alternative for I O intensive applications which would otherwise be crippled by the pmake customs network bottleneck Machines CPUs per machine RAM per machine Storage per machine Network Bandwidth NLP Cluster 9 2x 3 4 GHz 4 GB 300 GB 1 GBit s ICSI Desktop Cluster 80 1x 2 4 GHz 1 GB 30 GB 100 MBit s Table 1 Two clusters used in this paper s experiments 3 Benchmarks We evaluate the performance of the 80 node ICSI desktop cluster to establish benchmark results for given tasks and configurations Following the presentation in 3 we consider two computations that typify the majority of MapReduce algorithms a grep that searches through a large amount of data and a sort that transforms it from one representation to another These two tasks represent extreme cases for the reduce phase in which either none or all of the input data is shuffled across the network and written as output The data in these experiments are derived from the TeraSort benchmark 5 in which 100 byte records comprise a random 10 byte sorting key and 90 bytes of payload data In the comparable experiments in 3 a cluster of 1800 machines was used to process a terabyte of data due to our limited resources we scaled the experiments down to 10 gigabytes of data i e our input was 108 100 byte records We display our results in the same format as 3 showing the data transfer rates over the time of the computation distinguishing the map phase reading input from local disks from the reduce phase shuffling intermediate data over the network and writing the output to the distributed file system 1 This was Adam de Boor s CS 262 final project from 1989 and was updated by Andreas Stolcke for use at ICSI and SRI 2 Figure 1 Effect of varying the size of map tasks Left with 600 map tasks 16MB input splits performance suffers from repeated initialization costs Right with 75 map tasks 128MB input splits a high sustained read rate is attainable but the coarse granularity does not allow good load balancing 3 1 Distributed Grep The distributed grep program searches for a character pattern and writes intermediate data only when a match is encountered To simplify this further we can consider an empty map function which reads the input and never writes intermediate records Using the Hadoop streaming protocol 2 which reads records from STDIN and writes to STDOUT the map function is trivial usr bin perl while STDIN do nothing The reduce function is not needed since there is no intermediate data Thus this contrived program can be used to measure the maximal input data read rate for the map phase Figure 1 shows the effect of varying the number of map tasks equivalent to varying the size of an input split When the number of map tasks is high reasonable load balancing is possible because faster machines will have a chance to run more tasks than slower machines Also because the tasks finish
View Full Document
Unlocking...