DOC PREVIEW
UW CSE 444 - Study Notes

This preview shows page 1-2-15-16-17-32-33 out of 33 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 33 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 33 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 33 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 33 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 33 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 33 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 33 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 33 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Introduction to Database SystemsCSE 444Lecture 22-23: Pig Latin1CSE 444 - Summer 2010Outline• Based entirely on Pig Latin: A not-so-foreign language for data processing, by Olston, Reed, SrivastavaKumar and Tomkins 2008Srivastava, Kumar, and Tomkins, 20082CSE 444 - Summer 2010Why Pig Latin?• Map-reduce is a low-level programming environment• In most applications need more complex queriesPi thih l l i itt iPiLti•Pig accepts higher level queries written in Pig Latin, translates them into ensembles of MapReduce jobs– Pig is the system– Pig Latin is the language3CSE 444 - Summer 2010Pig Engine Overview•Data model = loosely typednested relations•Data model = loosely typed nested relations• Query model = a sql-like, dataflow language• Execution model:Option 1: run locally on your machine–Option 1: run locally on your machine– Option 2: compile into sequence of map/reduce, run on a cluster supporting Hadoop(e.g., AWS)pp gp(g, )• Main idea: use Opt1 to debug, Opt2 to execute4CSE 444 - Summer 2010Pig Engine OverviewPi L tiPig Latin program5Example• Input: a table of urls: (url, category, pagerank)• Compute the average pagerank of all sufficiently high pageranks, for each category• Return the answers only for categories with sufficiently many such pages6CSE 444 - Summer 2010First in SQL…SELECT category, AVG(pagerank)gy,(p g)FROM urlsWHEREpagerank>0.2WHERE pagerank 0.2GROUP By categoryHAVING COUNT(*)>106HAVING COUNT( ) > 107CSE 444 - Summer 2010…then in Pig-Latingood_urls = FILTER urls BY pagerank > 0.2groups = GROUP good_urls BY categorybig_groups = FILTER groups BY COUNT(good_urls) > 106output = FOREACH big_groups GENERATEcategory, AVG(good_urls.pagerank)Pig Latin combines •high-level declarative querying in the spirit of SQL,and8highlevel declarative querying in the spirit of SQL, and• low-level, procedural programming a la map-reduce.Types in Pig-Latin• Atomic: string or number, e.g. ‘Alice’ or 55• Tuple: (‘Alice’, 55, ‘salesperson’)• Bag: {(‘Alice’, 55, ‘salesperson’),(‘Betty’,44, ‘manager’), …}(y,, g),}•Maps: we will try not to use theseaps e y o o use ese9CSE 444 - Summer 2010Types in Pig-LatinBags can be nested !• {(‘a’, {1,4,3}), (‘c’,{ }), (‘d’, {2,2,5,3,2})}Tuple components can be referenced by number•$0 $1 $2$0, $1, $2, …10CSE 444 - Summer 201011Loading data• Input data = FILES !– Heard that before ?• The LOAD command parses an input file into a bag of records• Both parser (=“deserializer”) and output type are provided by user12CSE 444 - Summer 2010Loading dataqueries = LOAD ‘query_log.txt’USING userfuction( )AS (userID, queryString, timeStamp)(qy gp)13CSE 444 - Summer 2010Loading data• USING userfuction( ) -- is optional– Default deserializer expects tab-delimited file• AS type – is optional– Default is a record with unnamed fields; refer to them $$as $0, $1, …• The return value of LOAD is just a handle to a bagTh t l di i d i ll d ll li d–The actual reading is done in pull mode, or parallelized14CSE 444 - Summer 2010FOREACHexpanded_queries = FOREACH queriesGENERATE userId, expandQuery(queryString)expandQuery( ) is a UDF* that produces likely expansionsNote: it returns a bag henceexpanded queriesis a nested bagNote: it returns a bag, hence expanded_queriesis a nested bag*UDF = User Defined Function15CSE 444 - Summer 2010FOREACHexpanded_queries = FOREACH queriesGENERATE userId, flatten(expandQuery(queryString))Now we get a flat collection16Now we get a flat collectionCSE 444 - Summer 201017CSE 444 - Summer 2010FLATTENNote that it is NOT a first class function !(that’s one thing I* don’t like about Pig-latin)• First class FLATTEN:– FLATTEN({{2,3},{5},{},{4,5,6}}) = {2,3,5,4,5,6}{{ }}{}–Type: {{T}} {T}• Pig-latin FLATTENFLATTEN({4 5 6}) 4 5 6–FLATTEN({4,5,6}) = 4, 5, 6– Type: {T}  T, T, T, …, T ?????* “I” = original author of these slides. Opinions might or might not be consistent from quarter to quarter. ☺18CSE 444 - Summer 2010FILTERliFILTER i BYId‘bt’Remove all queries from Web bots:real_queries = FILTER queries BY userIdneq‘bot’Better: use a complex UDF to detect Web bots:real_queries = FILTER queries BY NOTisBot(userId)19BY NOT isBot(userId)CSE 444 - Summer 2010JOINres lts {(q er String rl position)}results: {(queryString, url, position)}revenue: {(queryString, adSlot, amount)}join_result = JOIN results BY queryStringrevenue BYqueryStringrevenue BY queryStringjoin_result : {(queryString, url, position, adSlot, amount)}20CSE 444 - Summer 201021CSE 444 - Summer 2010GROUP BYre en e {(q er String adSlot amo nt)}grouped_revenue = GROUP revenue BY queryStringrevenue: {(queryString, adSlot, amount)}query_revenues =FOREACH grouped_revenueGENERATE queryString,SUM(revenue.amount) AS totalRevenuegrouped revenue: {(queryString {(adSlot amount)})}22grouped_revenue: {(queryString, {(adSlot, amount)})}query_revenues: {(queryString, totalRevenue)}Simple Map-Reduceinp t {(field1 field2 field3 )}map_result = FOREACH input input : {(field1, field2, field3, . . . .)}GENERATE FLATTEN(map(*))key_groups = GROUP map_result BY $0output = FOREACH key_groupsGENERATE reduce($1)mapresult : {(a1, a2, a3, . . .)}23p_{( , , , )}key_groups : {(a1, {(a2, a3, . . .)})}Where we are…• Previously…– LOAD – read data– FOREACH – with and without flatten–FILTER–JOIN– GROUP BYNow•Now…– COGROUP: A generic way to group tuples from two datasets togethertwo datasets together24CSE 444 - Summer 2010Co-GroupDataset 1 results:{(queryStringurlposition)}grouped data=Dataset 1 results: {(queryString, url, position)}Dataset 2 revenue: {(queryString, adSlot, amount)}grouped_data= COGROUP results BY queryString,revenue BY queryString;qy g;grouped_data: {(queryString, results:{(url, position)}, revenue:{(adSlot, amount)})}revenue:{(adSlot, amount)})}What is the output type in general ?{group idbag dataset 1 bag dataset 2}25CSE 444 - Summer 2010{group_id, bag dataset 1, bag dataset 2}Co-GroupIs this an inner join or an outer join ?26CSE 444 - Summer 2010Co-Groupgrouped_data: {(queryString, results:{(url, position)}, revenue:{(adSlot, amount)})}url_revenues = FOREACH grouped_dataGENERATEFLATTEN(distributeRevenue(results, revenue));hdi t ib t Ri UDF th t t h…where distributeRevenueis a UDF that accepts search results and revenue information for a query string at a time, and outputs a bag of urls and the revenue


View Full Document

UW CSE 444 - Study Notes

Documents in this Course
XML

XML

48 pages

SQL

SQL

25 pages

SQL

SQL

42 pages

Recovery

Recovery

30 pages

SQL

SQL

36 pages

Indexes

Indexes

35 pages

Security

Security

36 pages

Wrap-up

Wrap-up

6 pages

SQL

SQL

37 pages

More SQL

More SQL

48 pages

SQL

SQL

35 pages

XML

XML

46 pages

Triggers

Triggers

26 pages

Load more
Download Study Notes
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Study Notes and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Study Notes 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?