2/18/20101SCOPE: Easy and Efficient Parallel Processing of Massive DatasetsgAppeared in VLDB 2008Spring’10, CPS 296.1Vamsidhar ThummalaSlides adapted from author’s VLDB presentationDistributed Computing ParadigmsGoogle Yahoo! MicrosoftStorage GFS/BigTable (Files: Chunk)HDFS(Files: Chunk)Cosmos(Files: Extent)Computation MR Hadoop/Pig Cosmos/DryadInterface Sawzall/MR PigLatin SCOPE/DryadLINQThe distinction is not entirely clearSCOPE Introduction (1/2)• Used in Live Search (informal)– Web data analysis, user log analysis, relevance studies• Infrastructure– Large shared nothing commodity hardware• Programming goals similar to DyradLINQ– Sequential, single machine programming abstraction– SQL emphasis• MR is too rigid• Writing MR programs is like writing physical execution plans in DBMSSCOPE Introduction (2/2)• Structured Computations Optimized for Parallel Execution– A declarative scripting language– Easy to use: SQL-like syntax plus MapRecuce-like extensionsModular: provides a rich class of runtime operators–Modular: provides a rich class of runtime operators– Highly extensible: • Fully integrated with .NET framework • Provides interfaces for customized operations– Flexible programming style: nested expressions or a series of simple transformationsArchitecture• Cosmos Storage system– Append-only distributed file system for storing petabytes of data–Optimized for sequential pqI/O– Data is compressed and replicated• Cosmos Execution Environment– DryadSCOPE – An example• Compute the popular queries that has been requested at least 1000 times2/18/20102Data Model, Input, Output• Data model– Relation row set with typed columns• Input, Output–Relational non-relational sourcesRelational, nonrelational sources– EXTRACT, OUTPUT commands are provided– USING clause allows customization (C#)Select and Join• Supports basic aggregation functions• No subqueriesNo subqueries - ExampleSELECT Ra, RbFROM RWHERE Rb < 100AND (Ra > 5 OR EXISTS (SELECT * FROM SWHERE Sa < 20AND Sc = Rc));• Equivalent query in SCOPESQ = SELECT DISTINCT Sc FROM S WHERE Sa < 20;M1 = SELECT Ra, Rb, Rc FROM R WHERE Rb < 100;M2 = SELECT Ra, Rb, Rc, Sc FROM M1 LEFT OUTER JOIN SQ ON Rc == Sc;Q = SELECT Ra, Rb FROM M2WHERE Ra > 5 or Rc != Sc;Deep integration with C#• SCOPE supports C# expressions and built-in .NET functions/library– User-defined scalar expressionsUserdefined aggregation functions–User-defined aggregation functionsUser Defined Operators• SCOPE supports three extensible commands: PROCESS, REDUCE, COMBINE– Complements SELECT for complicated analysis–Easy to customize by extending built-in C# asy to custo e by e te d g bu tcomponents– Easy to reuse code in other SCOPE scripts• Any resemblance with already seen operators?– Apply, Fork (Dryad)– FILTER, FLATTEN, COGROUP (PigLatin)PROCESS• PROCESS command takes a rowset as input, processes each row, and outputs a sequence of rows• Yield in C#2/18/20103REDUCE• REDUCE command takes a grouped rowset, processes each group, and outputs zero, one, or multiple rows per group• Why do we need REDUCE when you have GROUP BY?COMBINE• COMBINE command takes two matching input rowsets, combines them in some way, and outputs a sequence of rows• Example: MultiSetDifferenceImporting ScriptsIMPORT <script_file> [PARAMS<par_name> = <value> [,…]]Enables modularity and information hidingSCOPE Execution• SCOPE Compiler– Generates query plan using default plan for each command– Combines adjacent operators into a single vertex when possible• SCOPE Optimizer– Based on Cascades frameworkCbd–Cost-based– Not completely implemented • Some tricks– Not enough details in paper• SCOPE Runtime (Probably, Dryad)– Composable physical operators– Operators are implemented in iterator model– Executes series of operators in pipelined fashionExample Query Plan SCOPE vs. Other languagesSCOPE DryadLINQ PigLatin HiveLanguage SQL-like Embedded SQL Scripting/Some resemblance to SQLSQL-likeCompiler Default Plan DAG Default Plan Default PlanOptimizerNo details ??RulebasedRulebasedOptimizerNo details ??Rule-based (basic)Rule-based (basic)Runtime Cosmos (Pipelined)Dryad (EPG) Hadoop (Pipelined)Hadoop (Pipelined)2/18/20104Discussion• Performance– Not enough details• Operators can be improvedECbi–Extractors, Combiners• SQL flavor language?– DryadLINQ vs.
View Full Document