DOC PREVIEW
UT Dallas CS 6350 - 12.PigNew2014New

This preview shows page 1-2-3-4-26-27-28-54-55-56-57 out of 57 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 57 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 57 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 57 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 57 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 57 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 57 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 57 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 57 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 57 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 57 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 57 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 57 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Slide 1OutlineData Processing RenaissanceData Warehousing …?Map-ReduceSlide 6Slide 7The Map-Reduce AppealDisadvantagesMotivationEnter Pig LatinOutlinePig Latin: Data TypesExample: Data ModelPig Latin: Input/Output DataPig Latin: General SyntaxPig Latin: Expression TablePig Latin: FOREACH with FlattenPig Latin: COGROUPSlide 20Pig Latin: Map-ReducePig Latin: Other CommandsPig Latin: Nested OperationsPig Pen: Screen ShotPig Latin: Example 1Data FlowEquivalent Pig LatinExample 2: Data Analysis TaskData FlowEquivalent Pig LatinQuick Start and InteroperabilityQuick Start and InteroperabilityUser-Code as a First-Class CitizenNested Data ModelNested Data ModelCoGroupPig FeaturesOutlineCompilationParsingLogical PlanPhysical PlanLogic PlanPhysical PlanLogical to Physical Plan for Group operatorMap Reduce PlanSlide 47MapReduce PlanSlide 49Pig Latin to Physical PlanLogical Plan to Physical PlanPhysical Plan to Map-Reduce PlanImplementationCompilation into Map-ReducePerformanceStrong & Weak PointsSummaryThe Pig Experience:Building High-Level Data flows on top of Map-ReduceThe Pig Experience:Building High-Level Data flows on top of Map-ReduceDISTRIBUTED INFORMATION SYSTEMSVLDB paper Source: Javeria Iqbal, Martin Theobald Alan F. Gates, Olga Natkovich, Shubham Chopra, Pradeep Kamath, Shravan M. Narayanamurthy, Christopher Olston, Benjamin Reed, Santhosh Srinivasan, Utkarsh Srivastava, Building a High-Level Dataflow System on top of Map-Reduce: The Pig Experience, VLDB 2009.Outline•Map-Reduce and the need for Pig Latin•Pig Latin•Compilation into Map-Reduce•Optimization•Future WorkData Processing RenaissanceInternet companies swimming in data•TBs/day for Yahoo! Or Google!•PBs/day for FaceBook!Data analysis is “inner loop” of product innovationData Warehousing …?ScaleScaleOften not scalable enoughPricePriceProhibitively expensive at web scale• Up to $200K/TB SQLSQL•High level declarative approach •Little control over execution methodMap-Reduce•Map : Performs filtering•Reduce : Performs the aggregation•These are two high level declarative primitives to enable parallel processing•BUT no complex Database Operations e.g. JoinsSplit the ProgramSplit the ProgramMaster and Worker ThreadsMaster and Worker ThreadsWorker reads, parses key/value pairs and passes pairs to user-defined Map functionWorker reads, parses key/value pairs and passes pairs to user-defined Map functionBuffered pairs are written to local disk partitions, Location of buffered pairs are sent to reduce workersBuffered pairs are written to local disk partitions, Location of buffered pairs are sent to reduce workersExecution Overview of Map-ReduceReduce worker sorts databy the intermediate keys. Reduce worker sorts databy the intermediate keys. Unique keys, values are passed to user’s Reduce function.Output is appended to the output file for this reduce partition.Unique keys, values are passed to user’s Reduce function.Output is appended to the output file for this reduce partition.Execution Overview of Map-ReduceThe Map-Reduce AppealScaleScale• Scalable due to simpler design• Explicit programming model • Only parallelizable operationsPrice Price Runs on cheap commodity hardwareLess AdministrationProcedural Control- a processing “pipe”SQL SQLDisadvantages1. Extremely rigid data flowOther flows hacked inJoin, UnionSplitMMRRMMMMRRMMChains2. Common operations must be coded by hand• Join, filter, projection, aggregates, sorting, distinct3. Semantics hidden inside map-reduce functions• Difficult to maintain, extend, and optimize3. No combined processing of multiple Datasets• Joins and other data processing operationsMotivationNeed a high-level, general data flow languageEnter Pig LatinPig LatinPig LatinNeed a high-level, general data flow languageOutline•Map-Reduce and the need for Pig Latin•Pig Latin•Compilation into Map-Reduce•Optimization•Future WorkPig Latin: Data Types•Rich and Simple Data ModelSimple Types:int, long, double, chararray, bytearrayComplex Types:•Atom: String or Number e.g. (‘apple’)•Tuple: Collection of fields e.g. (áppe’, ‘mango’)•Bag: Collection of tuples{ (‘apple’ , ‘mango’) (ápple’, (‘red’ , ‘yellow’)) }•Map: Key, Value PairExample: Data Model•Atom: contains Single atomic value ‘alice’ ‘lanker’‘ipod’AtomAtomTupleTuple•Tuple: sequence of fields•Bag: collection of tuple with possible duplicatesPig Latin: Input/Output DataInput:queries = LOAD `query_log.txt'USING myLoad()AS (userId, queryString, timestamp);Output:STORE query_revenues INTO `myoutput'USING myStore();Pig Latin: General Syntax•Discarding Unwanted Data: FILTER•Comparison operators such as ==, eq, !=, neq•Logical connectors AND, OR, NOTPig Latin: Expression TablePig Latin: FOREACH with Flatten expanded_queries = FOREACH queries GENERATE userId, expandQuery(queryString); ----------------- expanded_queries = FOREACH queries GENERATE userId, FLATTEN(expandQuery(queryString));Pig Latin: COGROUP•Getting Related Data Together: COGROUP Suppose we have two data setsresult: (queryString, url, position)revenue: (queryString, adSlot, amount)grouped_data = COGROUP result BY queryString, revenue BY queryString;Pig Latin: COGROUP vs. JOINPig Latin: Map-Reduce•Map-Reduce in Pig Latinmap_result = FOREACH input GENERATE FLATTEN(map(*));key_group = GROUP map_result BY $0;output = FOREACH key_group GENERATE reduce(*);Pig Latin: Other Commands•UNION : Returns the union of two or more bags•CROSS: Returns the cross product•ORDER: Orders a bag by the specified field(s)•DISTINCT: Eliminates duplicate tuple in a bagPig Latin: Nested Operationsgrouped_revenue = GROUP revenue BY queryString;query_revenues = FOREACH grouped_revenue {top_slot = FILTER revenue BYadSlot eq `top';GENERATE queryString,SUM(top_slot.amount),SUM(revenue.amount);};Pig Pen: Screen ShotPig Latin: Example 1Suppose we have a tableurls: (url, category, pagerank)Simple SQL query that finds,For each sufficiently large category, the average pagerank of high-pagerank urls in that categorySELECT category, Avg(pagetank)FROM urls WHERE pagerank > 0.2GROUP BY category HAVING COUNT(*) > 106Data FlowFilter good_urlsby pagerank > 0.2Filter good_urlsby pagerank > 0.2Group by categoryGroup by categoryFilter categoryby count > 106Filter categoryby count > 106Foreach categorygenerate avg. pagerankForeach categorygenerate avg.


View Full Document

UT Dallas CS 6350 - 12.PigNew2014New

Documents in this Course
HW3

HW3

5 pages

NOSQL-CAP

NOSQL-CAP

23 pages

BigTable

BigTable

39 pages

HW3

HW3

5 pages

Load more
Download 12.PigNew2014New
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view 12.PigNew2014New and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view 12.PigNew2014New 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?