Big Data and Hadoop Developer Lesson 7 Pig Copyright 2014 Simplilearn All rights reserved Copyright 2014 Simplilearn All rights reserved Objectives By the end of this lesson you will be able to 2 Explain the concepts of Pig Demonstrate the installation of a Pig engine Explain the prerequisites for the preparation of environment for Pig Latin Copyright 2014 Simplilearn All rights reserved Challenges of MapReduce Development Using Java Following are some of the common challenges faced while performing MapReduce programming using Java 3 Developer needs to think in terms of map split and reduce fundamentals which may increase production time Developer has to write custom codes in Java which will have an impact on production time Extremely rigid data flow Code is difficult to maintain optimize and extend Copyright 2014 Simplilearn All rights reserved Introduction to Pig Pig is one of the components of the Hadoop eco system 4 Pig is a high level data flow scripting language Pig runs on the Hadoop clusters Pig is an Apache open source project Pig uses HDFS for storing and retrieving data and Hadoop MapReduce for processing Big Data Copyright 2014 Simplilearn All rights reserved Components of Pig Following are the major components of Pig Pig Latin script language Procedural data flow language Contains syntax and commands that can be applied to implement business logic Compiler that produces sequences of Map Reduce programs Uses HDFS for storing and retrieving the data Used to interact with the Hadoop system Parses validates and compiles the script operations into a sequence of MapReduce jobs 5 Runtime engine Example LOAD STORE and so on Copyright 2014 Simplilearn All rights reserved How Pig Works Pig s operation can be explained in 3 stages A LOAD myfile AS x y z B FILTER A by x 0 C GROUP B BY x D FOREACH A GENERATE x COUNT B STORE D INTO output Load data and write Pig script 6 Parses the script Map Filter Checks the scripts Reduce Count Optimizes the scripts Plans execution Submitted to Hadoop Monitors job progress Pig operations Execution of the Plan Copyright 2014 Simplilearn All rights reserved Data Model As part of its data model Pig supports four basic types Atom A simple atomic value Example Mike Tuple Bag Map 7 A sequence of fields that can be any of the data types Example Mike 43 A collection of tuples of potentially varying structures can contain duplicates Example Mike Doug 43 45 An associative array the key must be a chararray but the value can be any type Example name Mike phone 5551212 Copyright 2014 Simplilearn All rights reserved Data Model contd By default Pig treats undeclared fields as bytearrays Pig can infer a field s type based on use of operators that expect a certain type of field UDFs User Defined Functions with a known or explicitly set return type and schema information provided by a LOAD function or explicitly declared using an AS clause 8 Type conversion is lazy which means the data type is enforced at execution only Copyright 2014 Simplilearn All rights reserved Nested Data Model Pig Latin has a fully nestable data model with Atomic values tuples bags and maps Advantages 9 It is more natural to programmers than flat tuples It avoids expensive joins Copyright 2014 Simplilearn All rights reserved Pig Execution Modes Pig works in two execution modes Pig execution modes Local mode the Pig depends on the OS file system MapReduce mode the Pig relies on HDFS 10 Copyright 2014 Simplilearn All rights reserved Pig Interactive Modes Pig Latin program can be written in two interactive modes Pig Interactive Modes Interactive mode a line by line code is written and executed Batch mode a file containing Pig scripts is created and executed in a batch 11 Copyright 2014 Simplilearn All rights reserved Salient Features Some of the salient features of the Pig are as follows Step by step procedural control Operates directly over files Schemas are optional Supports UDFs Supports various data types 12 Copyright 2014 Simplilearn All rights reserved Pig vs SQL The difference between Pig and SQL are given in the table below Difference Pig Definition Scripting language used to interact with HDFS Query language used to interact with databases Query Style Step by step Single block Evaluation Lazy Evaluation Immediate evaluation Pipeline Splits Pipeline splits are supported Requires the join to be run twice or materialized as an intermediate result 13 SQL Copyright 2014 Simplilearn All rights reserved Pig vs SQL Example Track customers in Texas who spend more than 2000 USD SQL Pig SELECT c id SUM amount AS CTotal customer LOAD data customer dat AS c id name city FROM customers c sales LOAD data sales dat AS s id c id date amount JOIN sales s ON c c id s c id salesBLR FILTER customer BY city Texas WHERE c city Texas joined JOIN customer BY c id salesTX BY c id GROUP BY c id grouped GROUP joined BY c id HAVING SUM amount 2000 summed FOREACH grouped GENERATE GROUP ORDER BY CTotal DESC SUM joined salesTX amount spenders FILTER summed BY 1 2000 sorted ORDER spenders BY 1 DESC DUMP sorted 14 Copyright 2014 Simplilearn All rights reserved Installing Pig Engine To install the Pig engine you need to get the correct mirror web link from the website http pig apache org 15 Copyright 2014 Simplilearn All rights reserved Steps to Installing Pig Engine You need to perform the following steps to install Pig engine 2 1 Move the extracted folder to usr local Pig Unzip the downloaded file Export JAVA HOME path 5 4 Export Pig PREFIX path 16 3 Set the path for bin folder of Pig setup Copyright 2014 Simplilearn All rights reserved Installing Pig Engine Step 1 Unzip the downloaded file 17 Copyright 2014 Simplilearn All rights reserved Installing Pig Engine Step 2 Move the extracted folder to the following location usr local Pig 18 Copyright 2014 Simplilearn All rights reserved Installing Pig Engine Step 3 Export JAVA HOME path 19 Copyright 2014 Simplilearn All rights reserved Installing Pig Engine Step 4 Export Pig PREFIX path 20 Copyright 2014 Simplilearn All rights reserved Installing Pig Engine Step 5 Set the path for bin folder of Pig setup 21 Copyright 2014 Simplilearn All rights reserved Run a Sample Program to Test Pig You need to perform the following steps to run a sample program to test Pig 2 1 Run Pig in Local mode using the command Pig Start writing the program once the testdata txt data file is stored in the HDFS Check whether the file is accessible through the GRUNT shell grunt ls datadir 5 4
View Full Document