DOC PREVIEW
MIT 6 830 - Lecture Notes

This preview shows page 1-2-3 out of 9 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 9 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 9 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 9 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 9 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

DryadLINQ 11/24/09 -- Lecture 20 What is DryadLINQ? (Programming language and distributed execution framework for manipulating verylarge datasets spread across many computers.) LINQ -- programming language Dryad -- execution framework What are the goals of Dryad? Provide parallel database-style scalability in a more general purpose language. Users write a single program, don't worry about how it is partitioned across many machines. Why not SQL? Want to run arbitrary programs over data. Databases do allow users to write user defined functions and create data types, but this can be awkward at times. Want better integration with development framework and main (host) language. Why not MapReduce? It has a very restricted communication pattern that often requires multiple map/reduce phases to write complex tasks (e.g., joins.) No cross-phase optimization. What types of programs are they trying to run? They are not trying to provide transactions, and are focused more on chewing on a bunch of big data files than applying a few small updates, etc. What is the LINQ programming model? SQL-like, except that it is embedded in C#, and can make use of C# functions. It is compiled and type checked using C#. Example:Ex 1: using System; using System.Linq; using System.Collections.Generic; class app {static void Main() {string[] names = { "Burke", "Connor", "Frank","Everett", "Albert", "George","Harris", "David" }; IEnumerable<string> query = from s in names where s.Length == 5 orderby s select s.ToUpper(); foreach (string item in query)Console.WriteLine(item);}} Ex 2: string[] names = { "Albert", "Burke", "Connor", "David","Everett", "Frank", "George", "Harris"}; // group by length var groups = names.GroupBy(s => s.Length, s => s[0]); foreach (IGrouping<int, char> group in groups) { Console.WriteLine("Strings of length {0}", group.Key); foreach (char value in group)Console.WriteLine(" {0}", value);} This variation prints the following: Strings of length 6A G H Strings of length 5B D F Strings of length 7E CEx 3: (In general, you can combine a sequence of operations together) var adjustedScoreTriples =from d in scoreTriplesjoin r in staticRank on d.docID equals r.keyselect new QueryScoreDocIDTriple(d, r); var rankedQueries = from s in adjustedScoreTriples group s by s.query into g select TakeTopQueryResults(g); Can think of this as a graph of operators, just like in SQL (except that SQL plans aremostly trees, whereas here they are DAGs) rankedQueries group by takeTop... ... scoreTriples join staticRank queryScore adjustedScoreTriples ... Operators: select projectorder byjoingroup by Why is that good? What are the advantages? Avoids "Impedence Mismatch" where you run a SQL query, retrieve and typecheck theresults, then pack the results back into a string, which you send to the SQL engine,which typechecks the inputs, etc....Any disadvantages? Binds you to a specific language; complicates host language; may not support all ofSQL. How does this get distributed in DryadLINQ? (Show architecture diagram) Label pieces -- note similarity to BigTable/MapReduce Note that the output of each operator goes to disk Simple example: Sort a sample hist repartition repartition W 1 W 2 sample W 3 interleave interleave sort sort a,s,q,m b,d,b,a s,q,m b,b,a d 1:a-c 2:d-z a,b,b,a s,q,m,d a,a,b,b d,m,q,s Input data partitioned across nodes, just like in RDBMS MGR NSClientStorage SystemW 1 W 2 W 3Image by MIT OpenCourseWare.Example 2 -- Aggregate, e.g. Q2var names = GetTable<LineRecord>("file://names.txt")// group by lengthvar groups = names.GroupBy(s => s.Length, s => s[0]);var output = ToDryadTable(groups, "file://out.txt") Very similar to how a DDBMS would do this. How many partitions to create? Just say it "depends on the size of the input data" Are there restrictions on the programs that can be run? Says that they must be "side effect free" What does that mean? No modifications to any shared state. How restrictive is that? Can make it tough to do some operations -- for example, you can't write a group by in the language w/out special group by support. Ex: curgroup = 0 saved = NULL void aggGbySorted(t, groupf, aggf) if (groupf(t) != curgroup && saved != NULL) emit aggf(saved) saved = new list().append(t) curgroup = groupf(t) else saved.append(t)(can only auto-parallelie user defined funcs that are stateless) Can I read shared state? Yes How is that handled? Shipped to each processing node What if I can't figure out how to write my program in the above constraints or constructs? Apply (f, list) --> Can perform arbitrary computation on the list, but will be run on asingle node What compiler optimizations are supported? Static (e.g., at compile time) - Pipeline operators on the same node Ex: (somewhat unclear how they choose what to combine) sort partial agg. sort partial agg. - Eager aggregation -- see aggregate above; perform partial aggregation beforerepartitioning - I/O reduction don't write intermediate data to files -- instead, store in a buffer or send over a TCP pipe when possible Dynamic (e.g., at run time): - Dynamic aggregation- perform eager aggregation at edges between data centers - Dynamic partitioning- choose number of partitions to create based on the amount of dataunclear how they do this- Sort optimizations- specific plan for order by -- sample data to determine ranges, compute histogram, split into ranges with equal #s of tuples, redistribute and sort on each node. Optimizations are very specialized to sepcific programs -- e.g., they seem to have a specific optimization to make orderby, groupby, etc run fast, rather than true general purpose opts. Lots of talk about how they can 'automatically apply' many different optimizations andinfer properties like commutativity, etc., but a bit hard to figure out how these actually work. If you go look at their other papers they have more detail. In general, this means that programs with lots of custom ops may do manyrepartitionings -- e.g., after any selection that modifies tuples, repartitioning is needed. Fault tolerance Don't really mention it -- though they claim the ability to restart slow nodes as acontribution (kind of an oversight.) Presumably they do something like what MapReduce does -- if they notice that aparticular part of a job fails, they can simply restart/rerun it. Less clear that this works as well as in map reduce where there is a well defined notion of a "job" assigned to one or more nodes. Debugging,


View Full Document
Download Lecture Notes
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture Notes and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture Notes 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?