UT Dallas CS 6350 - 09.BigDataToolsUDF_PigLatin - D3094448

Home> Schools> University of Texas at Dallas> Computer Science (CS) > CS 6350> 09.BigDataToolsUDF_PigLatin

DOC PREVIEW

UT Dallas CS 6350 - 09.BigDataToolsUDF_PigLatin

School name University of Texas at Dallas

Course Cs 6350- Big Data Management and Analytics

Pages 27

This preview shows page 1-2-3-25-26-27 out of 27 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 27 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 27 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 27 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 27 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 27 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 27 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 27 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Slide 1What is Pig?Various ModesExecution Modes using the pig commandInteractive Mode Local ModeInteractive Mode Mapreduce ModeBatch ModeBatch ModePig ScriptsPig Latin StatementsDebuggingDebug: dumpDebug: describeDebug: explainDebug: illustratePig WordCount – Batch modePig WordCount – Batch modePig WordCount – Interactive modePig WordCount – Interactive modePig UDFSample UDF FunctionCompile & Run JarCreate Pig ScriptRun Pig ScriptPig ExampleReferencesThank YouBIG DATA ANALYTICS/MANAGEMENTCS 6350Pig Latin1What is Pig?Pig is a platform for analyzing large data sets.Pig's language, Pig Latin, lets you specify a sequence of data transformations such as merging data sets, filtering them, and applying functions to records or groups of records.At Yahoo! 40% of all Hadoop jobs are run with Pig.2Various ModesYou can run Pig (execute Pig Latin statements and Pig commands) using various modes.3 Local Mode Mapreduce ModeInteractive Mode yes yesBatch Mode yes yesExecution Modes using the pig commandLocal ModeLocal host and file system. /* local mode */ $ pig -x local ...Mapreduce Mode Access to a Hadoop cluster and HDFS installation. /* mapreduce mode */$ pig ... or $ pig -x mapreduce ...4Interactive Mode Local Modeusing the Grunt shell.Can be invoked using:$ pig -x local … - Connecting to ... grunt> grunt> A = load '/etc/passwd ' using PigStorage(':');grunt> B = foreach A generate $0 as id;grunt> dump B; 5Interactive Mode Mapreduce Modeusing the Grunt shell.Can be invoked using:$ pig -x mapreduce ... - Connecting to ... grunt> or$ pig ... - Connecting to ...grunt> 6Batch Mode/* id.pig */ A = load '/etc/passwd ' using PigStorage(':'); -- load the passwd file B = foreach A generate $0 as id; -- extract the user IDsstore B into ‘id.out’; -- write the results to a file name id.out7HDFS path in case of map-reduce modeBatch ModeLocal Mode $ pig -x local id.pig Mapreduce Mode $ pig id.pig or$ pig -x mapreduce id.pig 8Pig ScriptsTo place Pig Latin statements and Pig commands in a single file.Using the *.pig extension is good (please do it for HW too).9Pig Latin StatementsOutputRelation = InputRelationA relation is a bag.A bag is a collection of tuples. A tuple is an ordered set of fields.A field is a piece of data.10DebuggingUse the DUMP operator to display results to your terminal screen. Use the DESCRIBE operator to review the schema of a relation.Use the EXPLAIN operator to view the logical, physical, or map reduce execution plans to compute a relation.Use the ILLUSTRATE operator to view the step-by-step execution of a series of statements.11Debug: dumpA = LOAD 'student' AS (name:chararray, age:int, gpa:float);DUMP A;(John,18,4.0F)(Mary,19,3.7F)(Bill,20,3.9F)(Joe,22,3.8F)(Jill,20,4.0F)12Debug: describegrunt> A = load '/home/kma041000/pig/input' as (line:chararray);grunt> describe A; A: {line: chararray}grunt>13Debug: explaingrunt> explain A; #-----------------------------------------------# New Logical Plan:#-----------------------------------------------A: (Name: LOStore Schema: line#3:chararray)||---A: (Name: LOForEach Schema: line#3:chararray) | | | (Name: LOGenerate[false] Schema: line#3:chararray)ColumnPrune:InputUids=[3]ColumnPrune:OutputUids=[3] | | | | | (Name: Cast Type: chararray Uid: 3) | | | | | |---line:(Name: Project Type: bytearray Uid: 3 Input: 0 Column: (*)) | | | |---(Name: LOInnerLoad[0] Schema: line#3:bytearray) | |---A: (Name: LOLoad Schema: line#3:bytearray)RequiredFields:null#-----------------------------------------------# Physical Plan:#-----------------------------------------------A: Store(fakefile:org.apache.pig.builtin.PigStorage) - scope-5||---A: New For Each(false)[bag] - scope-4 | | | Cast[chararray] - scope-2 | | | |---Project[bytearray][0] - scope-1 | |---A: Load(/home/kma041000/pig/input:org.apache.pig.builtin.PigStorage) - scope-02013-03-19 18:19:40,210 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler - File concatenation threshold: 100 optimistic? false2013-03-19 18:19:40,255 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size before optimization: 12013-03-19 18:19:40,255 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size after optimization: 1#--------------------------------------------------# Map Reduce Plan#--------------------------------------------------MapReduce node scope-6Map PlanA: Store(fakefile:org.apache.pig.builtin.PigStorage) - scope-5||---A: New For Each(false)[bag] - scope-4 | | | Cast[chararray] - scope-2 | | | |---Project[bytearray][0] - scope-1 | |---A: Load(/home/kma041000/pig/input:org.apache.pig.builtin.PigStorage) - scope-0--------Global sort: false----------------grunt>14Debug: illustrategrunt> illustrate A;--------------------------------------| A | line:chararray |--------------------------------------| | word count in pig tutorial |--------------------------------------15Pig WordCount – Batch modeScript file name: wordcount.pigContains:------------A = load '/home/kma041000/pig/input';B = foreach A generate flatten(TOKENIZE((chararray)$0)) as word;C = group B by word;D = foreach C generate COUNT(B), group;/* rm '/home/kma041000/pig/output'; */store D into '/home/kma041000/pig/output';--------------16Pig WordCount – Batch modeTo run wordcount.pigBatch mode:{cs6360:~/BigData/Pig} pig -x mapreduce wordcount.pig.. Hadoop map/reduce is running….{cs6360:~/BigData/Pig} hadoop fs -cat /home/kma041000/pig/output/part-r-000004 in2 for4 pig4 20122 word2 count4 school4 summer2 indiana4 tutorial17Pig WordCount – Interactive mode{cs6360:~/BigData/Pig} piggrunt> A = load '/home/kma041000/pig/input';grunt> B = foreach A generate flatten(TOKENIZE((chararray)$0)) as word;grunt> C = group B by word;grunt> D = foreach C generate COUNT(B), group;grunt> rm '/home/kma041000/pig/output';grunt> dump D; /* to see output in terminal */grunt> store D into '/home/kma041000/pig/output'; /* to part-r-00000 file */ 18Pig WordCount – Interactive modegrunt> cat

View Full Document