Introduction to Database SystemsCSE 444Lecture 22-23: Pig Latin1CSE 444 - Summer 2010Outline• Based entirely on Pig Latin: A not-so-foreign language for data processing, by Olston, Reed, SrivastavaKumar and Tomkins 2008Srivastava, Kumar, and Tomkins, 20082CSE 444 - Summer 2010Why Pig Latin?• Map-reduce is a low-level programming environment• In most applications need more complex queriesPi thih l l i itt iPiLti•Pig accepts higher level queries written in Pig Latin, translates them into ensembles of MapReduce jobs– Pig is the system– Pig Latin is the language3CSE 444 - Summer 2010Pig Engine Overview•Data model = loosely typednested relations•Data model = loosely typed nested relations• Query model = a sql-like, dataflow language• Execution model:Option 1: run locally on your machine–Option 1: run locally on your machine– Option 2: compile into sequence of map/reduce, run on a cluster supporting Hadoop(e.g., AWS)pp gp(g, )• Main idea: use Opt1 to debug, Opt2 to execute4CSE 444 - Summer 2010Pig Engine OverviewPi L tiPig Latin program5Example• Input: a table of urls: (url, category, pagerank)• Compute the average pagerank of all sufficiently high pageranks, for each category• Return the answers only for categories with sufficiently many such pages6CSE 444 - Summer 2010First in SQL…SELECT category, AVG(pagerank)gy,(p g)FROM urlsWHEREpagerank>0.2WHERE pagerank 0.2GROUP By categoryHAVING COUNT(*)>106HAVING COUNT( ) > 107CSE 444 - Summer 2010…then in Pig-Latingood_urls = FILTER urls BY pagerank > 0.2groups = GROUP good_urls BY categorybig_groups = FILTER groups BY COUNT(good_urls) > 106output = FOREACH big_groups GENERATEcategory, AVG(good_urls.pagerank)Pig Latin combines •high-level declarative querying in the spirit of SQL,and8highlevel declarative querying in the spirit of SQL, and• low-level, procedural programming a la map-reduce.Types in Pig-Latin• Atomic: string or number, e.g. ‘Alice’ or 55• Tuple: (‘Alice’, 55, ‘salesperson’)• Bag: {(‘Alice’, 55, ‘salesperson’),(‘Betty’,44, ‘manager’), …}(y,, g),}•Maps: we will try not to use theseaps e y o o use ese9CSE 444 - Summer 2010Types in Pig-LatinBags can be nested !• {(‘a’, {1,4,3}), (‘c’,{ }), (‘d’, {2,2,5,3,2})}Tuple components can be referenced by number•$0 $1 $2$0, $1, $2, …10CSE 444 - Summer 201011Loading data• Input data = FILES !– Heard that before ?• The LOAD command parses an input file into a bag of records• Both parser (=“deserializer”) and output type are provided by user12CSE 444 - Summer 2010Loading dataqueries = LOAD ‘query_log.txt’USING userfuction( )AS (userID, queryString, timeStamp)(qy gp)13CSE 444 - Summer 2010Loading data• USING userfuction( ) -- is optional– Default deserializer expects tab-delimited file• AS type – is optional– Default is a record with unnamed fields; refer to them $$as $0, $1, …• The return value of LOAD is just a handle to a bagTh t l di i d i ll d ll li d–The actual reading is done in pull mode, or parallelized14CSE 444 - Summer 2010FOREACHexpanded_queries = FOREACH queriesGENERATE userId, expandQuery(queryString)expandQuery( ) is a UDF* that produces likely expansionsNote: it returns a bag henceexpanded queriesis a nested bagNote: it returns a bag, hence expanded_queriesis a nested bag*UDF = User Defined Function15CSE 444 - Summer 2010FOREACHexpanded_queries = FOREACH queriesGENERATE userId, flatten(expandQuery(queryString))Now we get a flat collection16Now we get a flat collectionCSE 444 - Summer 201017CSE 444 - Summer 2010FLATTENNote that it is NOT a first class function !(that’s one thing I* don’t like about Pig-latin)• First class FLATTEN:– FLATTEN({{2,3},{5},{},{4,5,6}}) = {2,3,5,4,5,6}{{ }}{}–Type: {{T}} {T}• Pig-latin FLATTENFLATTEN({4 5 6}) 4 5 6–FLATTEN({4,5,6}) = 4, 5, 6– Type: {T} T, T, T, …, T ?????* “I” = original author of these slides. Opinions might or might not be consistent from quarter to quarter. ☺18CSE 444 - Summer 2010FILTERliFILTER i BYId‘bt’Remove all queries from Web bots:real_queries = FILTER queries BY userIdneq‘bot’Better: use a complex UDF to detect Web bots:real_queries = FILTER queries BY NOTisBot(userId)19BY NOT isBot(userId)CSE 444 - Summer 2010JOINres lts {(q er String rl position)}results: {(queryString, url, position)}revenue: {(queryString, adSlot, amount)}join_result = JOIN results BY queryStringrevenue BYqueryStringrevenue BY queryStringjoin_result : {(queryString, url, position, adSlot, amount)}20CSE 444 - Summer 201021CSE 444 - Summer 2010GROUP BYre en e {(q er String adSlot amo nt)}grouped_revenue = GROUP revenue BY queryStringrevenue: {(queryString, adSlot, amount)}query_revenues =FOREACH grouped_revenueGENERATE queryString,SUM(revenue.amount) AS totalRevenuegrouped revenue: {(queryString {(adSlot amount)})}22grouped_revenue: {(queryString, {(adSlot, amount)})}query_revenues: {(queryString, totalRevenue)}Simple Map-Reduceinp t {(field1 field2 field3 )}map_result = FOREACH input input : {(field1, field2, field3, . . . .)}GENERATE FLATTEN(map(*))key_groups = GROUP map_result BY $0output = FOREACH key_groupsGENERATE reduce($1)mapresult : {(a1, a2, a3, . . .)}23p_{( , , , )}key_groups : {(a1, {(a2, a3, . . .)})}Where we are…• Previously…– LOAD – read data– FOREACH – with and without flatten–FILTER–JOIN– GROUP BYNow•Now…– COGROUP: A generic way to group tuples from two datasets togethertwo datasets together24CSE 444 - Summer 2010Co-GroupDataset 1 results:{(queryStringurlposition)}grouped data=Dataset 1 results: {(queryString, url, position)}Dataset 2 revenue: {(queryString, adSlot, amount)}grouped_data= COGROUP results BY queryString,revenue BY queryString;qy g;grouped_data: {(queryString, results:{(url, position)}, revenue:{(adSlot, amount)})}revenue:{(adSlot, amount)})}What is the output type in general ?{group idbag dataset 1 bag dataset 2}25CSE 444 - Summer 2010{group_id, bag dataset 1, bag dataset 2}Co-GroupIs this an inner join or an outer join ?26CSE 444 - Summer 2010Co-Groupgrouped_data: {(queryString, results:{(url, position)}, revenue:{(adSlot, amount)})}url_revenues = FOREACH grouped_dataGENERATEFLATTEN(distributeRevenue(results, revenue));hdi t ib t Ri UDF th t t h…where distributeRevenueis a UDF that accepts search results and revenue information for a query string at a time, and outputs a bag of urls and the revenue
View Full Document