Princeton COS 592 - Parallel Analysis with Sawzall

Unformatted text preview:

Interpreting the Data:Parallel Analysis with SawzallRob Pike, Sean Dorward, Robert Griesemer, Sean QuinlanGoogle, Inc.(Draft submitted to Scientific Programming Journal)AbstractVery large data sets often have a flat but regular structure and span multiple disks andmachines. Examples include telephone call records, network logs, and web document reposi-tories. These large data sets are not amenable to study using traditional database techniques, ifonly because they can be too large to fit in a single relational database. On the other hand, manyof the analyses done on them can be expressed using simple, easily distributed computations:filtering, aggregation, extraction of statistics, and so on.We present a system for automating such analyses. A filtering phase, in which a queryis expressed using a new programming language, emits data to an aggregation phase. Bothphases are distributed over hundreds or even thousands of computers. The results are thencollated and saved to a file. The design – including the separation into two phases, the formof the programming language, and the properties of the aggregators – exploits the parallelisminherent in having data and computation distributed across many machines.1 IntroductionMany data sets are too large, too dynamic, or just too unwieldy to be housed productively in arelational database. One common scenario is a set of many plain files – sometimes amounting topetabytes of data – distributed across many disks on many computers (Figure 1). The files in turncomprise many records, organized along some obvious axes such as time or geography. Examplesmight include a web page repository used to construct the index for an internet search engine, thesystem health records from thousands of on-line server machines, telephone call records or otherbusiness transaction logs, network packet traces, web server query logs, or even higher-level datasuch as satellite imagery.Quite often the analyses applied to these data sets can be expressed simply, using operations muchless sophisticated than a general SQL query. For instance, we might wish to count records thatsatisfy a certain property, or extract them, or search for anomalous records, or construct frequency1rack0123456789 10111213 14 15 1617 18 19 2021 22 23 2425 26 27 2829 30 31 3233 34 35 3637 38 39 4041 42 43 4445 46 47 4849 50 51 5253 54 55 5657 58 59 60rack1123456789 10111213 14 15 1617 18 19 2021 22 23 2425 26 27 2829 30 31 3233 34 35 3637 38 39 4041 42 43 4445 46 47 4849 50 51 5253 54 55 5657 58 59 60rack2123456789 10111213 14 15 1617 18 19 2021 22 23 2425 26 27 2829 30 31 3233 34 35 3637 38 39 4041 42 43 4445 46 47 4849 50 51 5253 54 55 5657 58 59 60rack3123456789 10111213 14 15 1617 18 19 2021 22 23 2425 26 27 2829 30 31 3233 34 35 3637 38 39 4041 42 43 4445 46 47 4849 50 51 5253 54 55 5657 58 59 60rack4123456789 10111213 14 15 1617 18 19 2021 22 23 2425 26 27 2829 30 31 3233 34 35 3637 38 39 4041 42 43 4445 46 47 4849 50 51 5253 54 55 5657 58 59 60Figure 1: Five racks of 50-55 working computers each, with four disks per machine. Sucha configuration might have 250 TB of data to be interpreted. Tremendous parallelism can beachieved by running a filtering phase independently on all 250+ machines and aggregatingtheir emitted results over a network between them (the arcs).histograms of the values of certain fields in the records. In other cases the analyses might be moreintricate but still be expressible as a series of simpler steps, operations that can be mapped easilyonto a set of records in files.Since the data records live on many machines, it would be fruitful to exploit the computing powerof those machines to perform these analyses. In particular, if the individual steps can be expressedas operations that can be evaluated one record at a time, we can distribute the calculation acrossall the machines and achieve very high throughput. (All the examples mentioned above have thisform.) The results of these simple operations will then require an aggregation phase. For example,if we are counting records, we need to gather the counts from the individual machines before wecan report the total count.We therefore break our calculations into two phases. The first phase evaluates the analysis oneach record individually, while the second phase aggregates the results (Figure 2). The systemdescribed in this paper goes even further, however. The analysis in the first phase is expressed in anew programming language that executes one record at a time, in isolation, while the second phase2is restricted to a set of predefined aggregators that process the intermediate results generated bythe first phase. By restricting the calculations to this model, we can achieve very high throughput.Although not all calculations fit this model well, the ability to harness a thousand or more machineswith a few lines of code provides some compensation.Aggregator Sorted results Emitted data Filters Raw data Figure 2: The overall flow of filtering, aggregating, and collating. Each stage typicallyinvolves less data than the previous.Of course, there are still many subproblems that remain to be solved. The calculation must bedivided into pieces and distributed across the machines holding the data, keeping the computationas near the data as possible to avoid network bottlenecks. And when there are many machinesthere is a high probability of some of them failing during the analysis, so the system must befault tolerant. These are difficult and interesting problems, but they should be handled withoutthe involvement of the user of the system. Google has several pieces of infrastructure, includingGFS [9] and MapReduce [8], that cope with fault tolerance and reliability and provide a powerfulframework upon which to implement a large, parallel system for distributed analysis. We thereforehide the details from the user and concentrate on the job at hand: expressing the analysis cleanlyand executing it quickly.2 OverviewTo summarize, our system takes a user query written in a special-purpose programming language,applies it in isolation to each record in the data set on many machines in parallel, gathers theresults, and aggregates them using a rich set of high-performance aggregators. These two stagesrun separately, often on different sets of computers.3A typical job will process anywhere from a gigabyte to many terabytes of data on hundreds oreven thousands of machines in parallel. An analysis may consume months of CPU time, but


View Full Document

Princeton COS 592 - Parallel Analysis with Sawzall

Download Parallel Analysis with Sawzall
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Parallel Analysis with Sawzall and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Parallel Analysis with Sawzall 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?