Slide 1What is Pig?Various ModesExecution Modes using the pig commandInteractive Mode Local ModeInteractive Mode Mapreduce ModeBatch ModeBatch ModePig ScriptsPig Latin StatementsDebuggingDebug: dumpDebug: describeDebug: explainDebug: illustratePig WordCount – Batch modePig WordCount – Batch modePig WordCount – Interactive modePig WordCount – Interactive modePig UDFSample UDF FunctionCompile & Run JarCreate Pig ScriptRun Pig ScriptPig ExampleReferencesThank YouBIG DATA ANALYTICS/MANAGEMENTCS 6350Pig Latin1What is Pig?Pig is a platform for analyzing large data sets.Pig's language, Pig Latin, lets you specify a sequence of data transformations such as merging data sets, filtering them, and applying functions to records or groups of records.At Yahoo! 40% of all Hadoop jobs are run with Pig.2Various ModesYou can run Pig (execute Pig Latin statements and Pig commands) using various modes.3 Local Mode Mapreduce ModeInteractive Mode yes yesBatch Mode yes yesExecution Modes using the pig commandLocal ModeLocal host and file system. /* local mode */ $ pig -x local ...Mapreduce Mode Access to a Hadoop cluster and HDFS installation. /* mapreduce mode */$ pig ... or $ pig -x mapreduce ...4Interactive Mode Local Modeusing the Grunt shell.Can be invoked using:$ pig -x local … - Connecting to ... grunt> grunt> A = load '/etc/passwd ' using PigStorage(':');grunt> B = foreach A generate $0 as id;grunt> dump B; 5Interactive Mode Mapreduce Modeusing the Grunt shell.Can be invoked using:$ pig -x mapreduce ... - Connecting to ... grunt> or$ pig ... - Connecting to ...grunt> 6Batch Mode/* id.pig */ A = load '/etc/passwd ' using PigStorage(':'); -- load the passwd file B = foreach A generate $0 as id; -- extract the user IDsstore B into ‘id.out’; -- write the results to a file name id.out7HDFS path in case of map-reduce modeBatch ModeLocal Mode $ pig -x local id.pig Mapreduce Mode $ pig id.pig or$ pig -x mapreduce id.pig 8Pig ScriptsTo place Pig Latin statements and Pig commands in a single file.Using the *.pig extension is good (please do it for HW too).9Pig Latin StatementsOutputRelation = InputRelationA relation is a bag.A bag is a collection of tuples. A tuple is an ordered set of fields.A field is a piece of data.10DebuggingUse the DUMP operator to display results to your terminal screen. Use the DESCRIBE operator to review the schema of a relation.Use the EXPLAIN operator to view the logical, physical, or map reduce execution plans to compute a relation.Use the ILLUSTRATE operator to view the step-by-step execution of a series of statements.11Debug: dumpA = LOAD 'student' AS (name:chararray, age:int, gpa:float);DUMP A;(John,18,4.0F)(Mary,19,3.7F)(Bill,20,3.9F)(Joe,22,3.8F)(Jill,20,4.0F)12Debug: describegrunt> A = load '/home/kma041000/pig/input' as (line:chararray);grunt> describe A; A: {line: chararray}grunt>13Debug: explaingrunt> explain A; #-----------------------------------------------# New Logical Plan:#-----------------------------------------------A: (Name: LOStore Schema: line#3:chararray)||---A: (Name: LOForEach Schema: line#3:chararray) | | | (Name: LOGenerate[false] Schema: line#3:chararray)ColumnPrune:InputUids=[3]ColumnPrune:OutputUids=[3] | | | | | (Name: Cast Type: chararray Uid: 3) | | | | | |---line:(Name: Project Type: bytearray Uid: 3 Input: 0 Column: (*)) | | | |---(Name: LOInnerLoad[0] Schema: line#3:bytearray) | |---A: (Name: LOLoad Schema: line#3:bytearray)RequiredFields:null#-----------------------------------------------# Physical Plan:#-----------------------------------------------A: Store(fakefile:org.apache.pig.builtin.PigStorage) - scope-5||---A: New For Each(false)[bag] - scope-4 | | | Cast[chararray] - scope-2 | | | |---Project[bytearray][0] - scope-1 | |---A: Load(/home/kma041000/pig/input:org.apache.pig.builtin.PigStorage) - scope-02013-03-19 18:19:40,210 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler - File concatenation threshold: 100 optimistic? false2013-03-19 18:19:40,255 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size before optimization: 12013-03-19 18:19:40,255 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size after optimization: 1#--------------------------------------------------# Map Reduce Plan#--------------------------------------------------MapReduce node scope-6Map PlanA: Store(fakefile:org.apache.pig.builtin.PigStorage) - scope-5||---A: New For Each(false)[bag] - scope-4 | | | Cast[chararray] - scope-2 | | | |---Project[bytearray][0] - scope-1 | |---A: Load(/home/kma041000/pig/input:org.apache.pig.builtin.PigStorage) - scope-0--------Global sort: false----------------grunt>14Debug: illustrategrunt> illustrate A;--------------------------------------| A | line:chararray |--------------------------------------| | word count in pig tutorial |--------------------------------------15Pig WordCount – Batch modeScript file name: wordcount.pigContains:------------A = load '/home/kma041000/pig/input';B = foreach A generate flatten(TOKENIZE((chararray)$0)) as word;C = group B by word;D = foreach C generate COUNT(B), group;/* rm '/home/kma041000/pig/output'; */store D into '/home/kma041000/pig/output';--------------16Pig WordCount – Batch modeTo run wordcount.pigBatch mode:{cs6360:~/BigData/Pig} pig -x mapreduce wordcount.pig.. Hadoop map/reduce is running….{cs6360:~/BigData/Pig} hadoop fs -cat /home/kma041000/pig/output/part-r-000004 in2 for4 pig4 20122 word2 count4 school4 summer2 indiana4 tutorial17Pig WordCount – Interactive mode{cs6360:~/BigData/Pig} piggrunt> A = load '/home/kma041000/pig/input';grunt> B = foreach A generate flatten(TOKENIZE((chararray)$0)) as word;grunt> C = group B by word;grunt> D = foreach C generate COUNT(B), group;grunt> rm '/home/kma041000/pig/output';grunt> dump D; /* to see output in terminal */grunt> store D into '/home/kma041000/pig/output'; /* to part-r-00000 file */ 18Pig WordCount – Interactive modegrunt> cat
View Full Document