DOC PREVIEW
Duke CPS 296.1 - SCOPE: Easy and Efficient Parallel

This preview shows page 1 out of 4 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 4 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 4 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

2/18/20101SCOPE: Easy and Efficient Parallel Processing of Massive DatasetsgAppeared in VLDB 2008Spring’10, CPS 296.1Vamsidhar ThummalaSlides adapted from author’s VLDB presentationDistributed Computing ParadigmsGoogle Yahoo! MicrosoftStorage GFS/BigTable (Files: Chunk)HDFS(Files: Chunk)Cosmos(Files: Extent)Computation MR Hadoop/Pig Cosmos/DryadInterface Sawzall/MR PigLatin SCOPE/DryadLINQThe distinction is not entirely clearSCOPE Introduction (1/2)• Used in Live Search (informal)– Web data analysis, user log analysis, relevance studies• Infrastructure– Large shared nothing commodity hardware• Programming goals similar to DyradLINQ– Sequential, single machine programming abstraction– SQL emphasis• MR is too rigid• Writing MR programs is like writing physical execution plans in DBMSSCOPE Introduction (2/2)• Structured Computations Optimized for Parallel Execution– A declarative scripting language– Easy to use: SQL-like syntax plus MapRecuce-like extensionsModular: provides a rich class of runtime operators–Modular: provides a rich class of runtime operators– Highly extensible: • Fully integrated with .NET framework • Provides interfaces for customized operations– Flexible programming style: nested expressions or a series of simple transformationsArchitecture• Cosmos Storage system– Append-only distributed file system for storing petabytes of data–Optimized for sequential pqI/O– Data is compressed and replicated• Cosmos Execution Environment– DryadSCOPE – An example• Compute the popular queries that has been requested at least 1000 times2/18/20102Data Model, Input, Output• Data model– Relation row set with typed columns• Input, Output–Relational non-relational sourcesRelational, nonrelational sources– EXTRACT, OUTPUT commands are provided– USING clause allows customization (C#)Select and Join• Supports basic aggregation functions• No subqueriesNo subqueries - ExampleSELECT Ra, RbFROM RWHERE Rb < 100AND (Ra > 5 OR EXISTS (SELECT * FROM SWHERE Sa < 20AND Sc = Rc));• Equivalent query in SCOPESQ = SELECT DISTINCT Sc FROM S WHERE Sa < 20;M1 = SELECT Ra, Rb, Rc FROM R WHERE Rb < 100;M2 = SELECT Ra, Rb, Rc, Sc FROM M1 LEFT OUTER JOIN SQ ON Rc == Sc;Q = SELECT Ra, Rb FROM M2WHERE Ra > 5 or Rc != Sc;Deep integration with C#• SCOPE supports C# expressions and built-in .NET functions/library– User-defined scalar expressionsUserdefined aggregation functions–User-defined aggregation functionsUser Defined Operators• SCOPE supports three extensible commands: PROCESS, REDUCE, COMBINE– Complements SELECT for complicated analysis–Easy to customize by extending built-in C# asy to custo e by e te d g bu tcomponents– Easy to reuse code in other SCOPE scripts• Any resemblance with already seen operators?– Apply, Fork (Dryad)– FILTER, FLATTEN, COGROUP (PigLatin)PROCESS• PROCESS command takes a rowset as input, processes each row, and outputs a sequence of rows• Yield in C#2/18/20103REDUCE• REDUCE command takes a grouped rowset, processes each group, and outputs zero, one, or multiple rows per group• Why do we need REDUCE when you have GROUP BY?COMBINE• COMBINE command takes two matching input rowsets, combines them in some way, and outputs a sequence of rows• Example: MultiSetDifferenceImporting ScriptsIMPORT <script_file> [PARAMS<par_name> = <value> [,…]]Enables modularity and information hidingSCOPE Execution• SCOPE Compiler– Generates query plan using default plan for each command– Combines adjacent operators into a single vertex when possible• SCOPE Optimizer– Based on Cascades frameworkCbd–Cost-based– Not completely implemented • Some tricks– Not enough details in paper• SCOPE Runtime (Probably, Dryad)– Composable physical operators– Operators are implemented in iterator model– Executes series of operators in pipelined fashionExample Query Plan SCOPE vs. Other languagesSCOPE DryadLINQ PigLatin HiveLanguage SQL-like Embedded SQL Scripting/Some resemblance to SQLSQL-likeCompiler Default Plan DAG Default Plan Default PlanOptimizerNo details ??RulebasedRulebasedOptimizerNo details ??Rule-based (basic)Rule-based (basic)Runtime Cosmos (Pipelined)Dryad (EPG) Hadoop (Pipelined)Hadoop (Pipelined)2/18/20104Discussion• Performance– Not enough details• Operators can be improvedECbi–Extractors, Combiners• SQL flavor language?– DryadLINQ vs.


View Full Document

Duke CPS 296.1 - SCOPE: Easy and Efficient Parallel

Documents in this Course
Lecture

Lecture

18 pages

Lecture

Lecture

6 pages

Lecture

Lecture

13 pages

Lecture

Lecture

5 pages

Load more
Download SCOPE: Easy and Efficient Parallel
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view SCOPE: Easy and Efficient Parallel and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view SCOPE: Easy and Efficient Parallel 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?