DOC PREVIEW
UT Dallas CS 6350 - 09.BigDataToolsUDF_PigLatin

This preview shows page 1-2-3-25-26-27 out of 27 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 27 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 27 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 27 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 27 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 27 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 27 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 27 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Slide 1What is Pig?Various ModesExecution Modes using the pig commandInteractive Mode Local ModeInteractive Mode Mapreduce ModeBatch ModeBatch ModePig ScriptsPig Latin StatementsDebuggingDebug: dumpDebug: describeDebug: explainDebug: illustratePig WordCount – Batch modePig WordCount – Batch modePig WordCount – Interactive modePig WordCount – Interactive modePig UDFSample UDF FunctionCompile & Run JarCreate Pig ScriptRun Pig ScriptPig ExampleReferencesThank YouBIG DATA ANALYTICS/MANAGEMENTCS 6350Pig Latin1What is Pig?Pig is a platform for analyzing large data sets.Pig's language, Pig Latin, lets you specify a sequence of data transformations such as merging data sets, filtering them, and applying functions to records or groups of records.At Yahoo! 40% of all Hadoop jobs are run with Pig.2Various ModesYou can run Pig (execute Pig Latin statements and Pig commands) using various modes.3 Local Mode Mapreduce ModeInteractive Mode yes yesBatch Mode yes yesExecution Modes using the pig commandLocal ModeLocal host and file system. /* local mode */ $ pig -x local ...Mapreduce Mode Access to a Hadoop cluster and HDFS installation. /* mapreduce mode */$ pig ... or $ pig -x mapreduce ...4Interactive Mode Local Modeusing the Grunt shell.Can be invoked using:$ pig -x local … - Connecting to ... grunt> grunt> A = load '/etc/passwd ' using PigStorage(':');grunt> B = foreach A generate $0 as id;grunt> dump B; 5Interactive Mode Mapreduce Modeusing the Grunt shell.Can be invoked using:$ pig -x mapreduce ... - Connecting to ... grunt> or$ pig ... - Connecting to ...grunt> 6Batch Mode/* id.pig */ A = load '/etc/passwd ' using PigStorage(':'); -- load the passwd file B = foreach A generate $0 as id; -- extract the user IDsstore B into ‘id.out’; -- write the results to a file name id.out7HDFS path in case of map-reduce modeBatch ModeLocal Mode $ pig -x local id.pig Mapreduce Mode $ pig id.pig or$ pig -x mapreduce id.pig 8Pig ScriptsTo place Pig Latin statements and Pig commands in a single file.Using the *.pig extension is good (please do it for HW too).9Pig Latin StatementsOutputRelation = InputRelationA relation is a bag.A bag is a collection of tuples. A tuple is an ordered set of fields.A field is a piece of data.10DebuggingUse the DUMP operator to display results to your terminal screen. Use the DESCRIBE operator to review the schema of a relation.Use the EXPLAIN operator to view the logical, physical, or map reduce execution plans to compute a relation.Use the ILLUSTRATE operator to view the step-by-step execution of a series of statements.11Debug: dumpA = LOAD 'student' AS (name:chararray, age:int, gpa:float);DUMP A;(John,18,4.0F)(Mary,19,3.7F)(Bill,20,3.9F)(Joe,22,3.8F)(Jill,20,4.0F)12Debug: describegrunt> A = load '/home/kma041000/pig/input' as (line:chararray);grunt> describe A; A: {line: chararray}grunt>13Debug: explaingrunt> explain A; #-----------------------------------------------# New Logical Plan:#-----------------------------------------------A: (Name: LOStore Schema: line#3:chararray)||---A: (Name: LOForEach Schema: line#3:chararray) | | | (Name: LOGenerate[false] Schema: line#3:chararray)ColumnPrune:InputUids=[3]ColumnPrune:OutputUids=[3] | | | | | (Name: Cast Type: chararray Uid: 3) | | | | | |---line:(Name: Project Type: bytearray Uid: 3 Input: 0 Column: (*)) | | | |---(Name: LOInnerLoad[0] Schema: line#3:bytearray) | |---A: (Name: LOLoad Schema: line#3:bytearray)RequiredFields:null#-----------------------------------------------# Physical Plan:#-----------------------------------------------A: Store(fakefile:org.apache.pig.builtin.PigStorage) - scope-5||---A: New For Each(false)[bag] - scope-4 | | | Cast[chararray] - scope-2 | | | |---Project[bytearray][0] - scope-1 | |---A: Load(/home/kma041000/pig/input:org.apache.pig.builtin.PigStorage) - scope-02013-03-19 18:19:40,210 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler - File concatenation threshold: 100 optimistic? false2013-03-19 18:19:40,255 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size before optimization: 12013-03-19 18:19:40,255 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size after optimization: 1#--------------------------------------------------# Map Reduce Plan#--------------------------------------------------MapReduce node scope-6Map PlanA: Store(fakefile:org.apache.pig.builtin.PigStorage) - scope-5||---A: New For Each(false)[bag] - scope-4 | | | Cast[chararray] - scope-2 | | | |---Project[bytearray][0] - scope-1 | |---A: Load(/home/kma041000/pig/input:org.apache.pig.builtin.PigStorage) - scope-0--------Global sort: false----------------grunt>14Debug: illustrategrunt> illustrate A;--------------------------------------| A | line:chararray |--------------------------------------| | word count in pig tutorial |--------------------------------------15Pig WordCount – Batch modeScript file name: wordcount.pigContains:------------A = load '/home/kma041000/pig/input';B = foreach A generate flatten(TOKENIZE((chararray)$0)) as word;C = group B by word;D = foreach C generate COUNT(B), group;/* rm '/home/kma041000/pig/output'; */store D into '/home/kma041000/pig/output';--------------16Pig WordCount – Batch modeTo run wordcount.pigBatch mode:{cs6360:~/BigData/Pig} pig -x mapreduce wordcount.pig.. Hadoop map/reduce is running….{cs6360:~/BigData/Pig} hadoop fs -cat /home/kma041000/pig/output/part-r-000004 in2 for4 pig4 20122 word2 count4 school4 summer2 indiana4 tutorial17Pig WordCount – Interactive mode{cs6360:~/BigData/Pig} piggrunt> A = load '/home/kma041000/pig/input';grunt> B = foreach A generate flatten(TOKENIZE((chararray)$0)) as word;grunt> C = group B by word;grunt> D = foreach C generate COUNT(B), group;grunt> rm '/home/kma041000/pig/output';grunt> dump D; /* to see output in terminal */grunt> store D into '/home/kma041000/pig/output'; /* to part-r-00000 file */ 18Pig WordCount – Interactive modegrunt> cat


View Full Document

UT Dallas CS 6350 - 09.BigDataToolsUDF_PigLatin

Documents in this Course
HW3

HW3

5 pages

NOSQL-CAP

NOSQL-CAP

23 pages

BigTable

BigTable

39 pages

HW3

HW3

5 pages

Load more
Download 09.BigDataToolsUDF_PigLatin
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view 09.BigDataToolsUDF_PigLatin and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view 09.BigDataToolsUDF_PigLatin 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?