UT Dallas CS 6350 - PigNew2014New - D3102935

Home> Schools> University of Texas at Dallas> Computer Science (CS) > CS 6350> PigNew2014New

DOC PREVIEW

UT Dallas CS 6350 - PigNew2014New

School name University of Texas at Dallas

Course Cs 6350- Big Data Management and Analytics

Pages 57

This preview shows page 1-2-3-4-26-27-28-54-55-56-57 out of 57 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 57 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

DISTRIBUTED INFORMATION SYSTEMS The Pig Experience Building High Level Data flows on top of Map Reduce VLDB paper Source Javeria Iqbal Martin Theobald Alan F Gates Olga Natkovich Shubham Chopra Pradeep Kamath Shravan M Narayanamurthy Christopher Olston Benjamin Reed Santhosh Srinivasan Utkarsh Srivastava Building a High Level Dataflow System on top of Map Reduce The Pig Experience VLDB 2009 Outline Map Reduce and the need for Pig Latin Pig Latin Compilation into Map Reduce Optimization Future Work Data Processing Renaissance Internet companies swimming in data TBs day for Yahoo Or Google PBs day for FaceBook Data analysis is inner loop of product innovation Data Warehousing Scale High level declarative approach Little control over execution method Price Prohibitively expensive at web scale Up to 200K TB SQL Often not scalable enough Map Reduce Map Performs filtering Reduce Performs the aggregation These are two high level declarative primitives to enable parallel processing BUT no complex Database Operations e g Joins Execution Overview of Map Reduce Buffered Bufferedpairs pairsare arewritten writtento tolocal localdisk diskpartitions partitions Location Locationof ofbuffered bufferedpairs pairsare aresent sentto toreduce reduceworkers workers Worker Workerreads reads parses parseskey value key valuepa p passes passespairs pairsto touser defined user definedMap Map Split Splitthe theProgram Program Master Masterand andWorker WorkerThreads Threads Execution Overview of Map Reduce Unique Uniquekeys keys values valuesare arepassed passedto touser s user sReduce Reducefunction functio Output Outputisisappended appendedto tothe theoutput outputfile filefor forthis thisreduce reducepart part Reduce Reduceworker workersorts sortsdata data by bythe theintermediate intermediatekeys keys The Map Reduce Appeal Scale Scalable due to simpler design Explicit programming model Only parallelizable operations Price Runs on cheap commodity hardware Less Administration SQL Procedural Control a processing pipe Disadvantages 1 Extremely rigid data flow M R Other flows hacked in M Join Union Split M R Chains 2 Common operations must be coded by hand Join filter projection aggregates sorting distinct 3 Semantics hidden inside map reduce functions Difficult to maintain extend and optimize 3 No combined processing of multiple Datasets Joins and other data processing operations M Motivation Need a high level general data flow language Enter Pig Latin Need a high level general data flow language Pig L atin Outline Map Reduce and the need for Pig Latin Pig Latin Compilation into Map Reduce Optimization Future Work Pig Latin Data Types Rich and Simple Data Model Simple Types int long double chararray bytearray Complex Types Atom String or Number e g apple Tuple Collection of fields e g ppe mango Bag Collection of tuples apple mango pple red yellow Map Key Value Pair Example Data Model Atom contains Single atomic value Tuple sequence of fields Bag collection of tuple with possible duplicates Atom Tuple lanker alice ipod Pig Latin Input Output Data Input queries LOAD query log txt USING myLoad AS userId queryString timestamp Output STORE query revenues INTO myoutput USING myStore Pig Latin General Syntax Discarding Unwanted Data FILTER Comparison operators such as eq neq Logical connectors AND OR NOT Pig Latin Expression Table Pig Latin FOREACH with Flatten expanded queries FOREACH queries GENERATE userId expandQuery queryString expanded queries FOREACH queries GENERATE userId FLATTEN expandQuery queryString Pig Latin COGROUP Getting Related Data Together COGROUP Suppose we have two data sets result queryString url position revenue queryString adSlot amount grouped data COGROUP result BY queryString revenue BY queryString Pig Latin COGROUP vs JOIN Pig Latin Map Reduce Map Reduce in Pig Latin map result FOREACH input GENERATE FLATTEN map key group GROUP map result BY 0 output FOREACH key group GENERATE reduce Pig Latin Other Commands UNION Returns the union of two or more bags CROSS Returns the cross product ORDER Orders a bag by the specified field s DISTINCT Eliminates duplicate tuple in a bag Pig Latin Nested Operations grouped revenue GROUP revenue BY queryString query revenues FOREACH grouped revenue top slot FILTER revenue BY adSlot eq top GENERATE queryString SUM top slot amount SUM revenue amount Pig Pen Screen Shot Pig Latin Example 1 Suppose we have a table urls url category pagerank Simple SQL query that finds For each sufficiently large category the average pagerank of highpagerank urls in that category SELECT category Avg pagetank FROM urls WHERE pagerank 0 2 GROUP BY category HAVING COUNT 106 Data Flow Filter good urls by pagerank 0 2 Group by category Filter category by count 106 Foreach category generate avg pagerank Equivalent Pig Latin good urls FILTER urls BY pagerank 0 2 groups GROUP good urls BY category big groups FILTER groups BY COUNT good urls 106 output FOREACH big groups GENERATE category AVG good urls pagerank Example 2 Data Analysis Task Find the top 10 most visited pages in each category Visits Url Info User Url Time Url Category PageRank Amy cnn com 8 00 cnn com News 0 9 Amy bbc com 10 00 bbc com News 0 8 Amy flickr com 10 05 flickr com Photos 0 7 Fred cnn com 12 00 espn com Sports 0 9 Data Flow Load Visits Group by url Foreach url generate count Load Url Info Join on url Group by category Foreach category generate top10 urls Equivalent Pig Latin visits load data visits as user url time gVisits group visits by url visitCounts foreach gVisits generate url count visits urlInfo load data urlInfo as url category pRank visitCounts join visitCounts by url urlInfo by url gCategories group visitCounts by category topUrls foreach gCategories generate top visitCounts 10 store topUrls into data topUrls Quick Start and Interoperability visits load data visits as user url time gVisits group visits by url visitCounts foreach gVisits generate url count urlVisits urlInfo load data urlInfo as url category pRank visitCounts join visitCounts by url urlInfo by url gCategories group visitCounts by category Operates directly over files topUrls foreach gCategories generate top visitCounts 10 store topUrls into data topUrls Quick Start and Interoperability visits load data visits as user url time gVisits group visits by url visitCounts foreach gVisits generate url count urlVisits urlInfo load data urlInfo as url category pRank visitCounts join visitCounts by url urlInfo by url optional gCategories

View Full Document