Unformatted text preview:

DryadLINQ A System for General Purpose Distributed Data Parallel Computing Using a High Level Language Yuan Yu Michael Isard Dennis Fetterly Mihai Budiu lfar Erlingsson1 Pradeep Kumar Gunda Jon Currey Microsoft Research Silicon Valley 1 joint affiliation Reykjav k University Iceland Abstract DryadLINQ is a system and a set of language extensions that enable a new programming model for large scale distributed computing It generalizes previous execution environments such as SQL MapReduce and Dryad in two ways by adopting an expressive data model of strongly typed NET objects and by supporting general purpose imperative and declarative operations on datasets within a traditional high level programming language A DryadLINQ program is a sequential program composed of LINQ expressions performing arbitrary sideeffect free transformations on datasets and can be written and debugged using standard NET development tools The DryadLINQ system automatically and transparently translates the data parallel portions of the program into a distributed execution plan which is passed to the Dryad execution platform Dryad which has been in continuous operation for several years on production clusters made up of thousands of computers ensures efficient reliable execution of this plan We describe the implementation of the DryadLINQ compiler and runtime We evaluate DryadLINQ on a varied set of programs drawn from domains such as web graph analysis large scale log mining and machine learning We show that excellent absolute performance can be attained a general purpose sort of 1012 Bytes of data executes in 319 seconds on a 240 computer 960disk cluster as well as demonstrating near linear scaling of execution time on representative applications as we vary the number of computers used for a job 1 Introduction The DryadLINQ system is designed to make it easy for a wide variety of developers to compute effectively on large amounts of data DryadLINQ programs are written as imperative or declarative operations on datasets within a traditional high level programming language using an USENIX Association expressive data model of strongly typed NET objects The main contribution of this paper is a set of language extensions and a corresponding system that can automatically and transparently compile imperative programs in a general purpose language into distributed computations that execute efficiently on large computing clusters Our goal is to give the programmer the illusion of writing for a single computer and to have the system deal with the complexities that arise from scheduling distribution and fault tolerance Achieving this goal requires a wide variety of components to interact including cluster management software distributedexecution middleware language constructs and development tools Traditional parallel databases which we survey in Section 6 1 as well as more recent data processing systems such as MapReduce 15 and Dryad 26 demonstrate that it is possible to implement high performance large scale execution engines at modest financial cost and clusters running such platforms are proliferating Even so their programming interfaces all leave room for improvement We therefore believe that the language issues addressed in this paper are currently among the most pressing research areas for dataintensive computing and our work on the DryadLINQ system stems from this belief DryadLINQ exploits LINQ Language INtegrated Query 2 a set of NET constructs for programming with datasets to provide a powerful hybrid of declarative and imperative programming The system is designed to provide flexible and efficient distributed computation in any LINQ enabled programming language including C VB and F Objects in DryadLINQ datasets can be of any NET type making it easy to compute with data such as image patches vectors and matrices DryadLINQ programs can use traditional structuring constructs such as functions modules and libraries and express iteration using standard loops Crucially the distributed execution layer employs a fully functional declarative description of the data parallel component of the computation 8th USENIX Symposium on Operating Systems Design and Implementation 1 which enables sophisticated rewritings and optimizations like those traditionally employed by parallel databases In contrast parallel databases implement only declarative variants of SQL queries There is by now a widespread belief that SQL is too limited for many applications 15 26 31 34 35 One problem is that in order to support database requirements such as in place updates and efficient transactions SQL adopts a very restrictive type system In addition the declarative queryoriented nature of SQL makes it difficult to express common programming patterns such as iteration 14 Together these make SQL unsuitable for tasks such as machine learning content parsing and web graph analysis that increasingly must be run on very large datasets The MapReduce system 15 adopted a radically simplified programming abstraction however even common operations like database Join are tricky to implement in this model Moreover it is necessary to embed MapReduce computations in a scripting language in order to execute programs that require more than one reduction or sorting stage Each MapReduce instantiation is selfcontained and no automatic optimizations take place across their boundaries In addition the lack of any typesystem support or integration between the MapReduce stages requires programmers to explicitly keep track of objects passed between these stages and may complicate long term maintenance and re use of software components Several domain specific languages have appeared on top of the MapReduce abstraction to hide some of this complexity from the programmer including Sawzall 32 Pig 31 and other unpublished systems such as Facebook s HIVE These offer a limited hybridization of declarative and imperative programs and generalize SQL s stored procedure model Some wholequery optimizations are automatically applied by these systems across MapReduce computation boundaries However these approaches inherit many of SQL s disadvantages adopting simple custom type systems and providing limited support for iterative computations Their support for optimizations is less advanced than that in DryadLINQ partly because the underlying MapReduce execution platform is much less flexible than Dryad DryadLINQ and systems such as MapReduce are also distinguished from traditional


View Full Document

MIT 6 824 - DryadLINQ- A System for General-Purpose Distributed Data-Parallel Computing

Documents in this Course
Logging

Logging

4 pages

Load more
Loading Unlocking...
Login

Join to view DryadLINQ- A System for General-Purpose Distributed Data-Parallel Computing and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view DryadLINQ- A System for General-Purpose Distributed Data-Parallel Computing and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?