UW CSE 444 - Study Notes - D3018893

Home> Schools> University of Washington> Computer Science and Engineering (CSE) > CSE 444> Study Notes

DOC PREVIEW

UW CSE 444 - Study Notes

School name University of Washington

Course Cse 444- Database Internals

Pages 33

This preview shows page 1-2-15-16-17-32-33 out of 33 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 33 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 33 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 33 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 33 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 33 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 33 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 33 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 33 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Introduction to Database SystemsCSE 444Lecture 22-23: Pig Latin1CSE 444 - Summer 2010Outline• Based entirely on Pig Latin: A not-so-foreign language for data processing, by Olston, Reed, SrivastavaKumar and Tomkins 2008Srivastava, Kumar, and Tomkins, 20082CSE 444 - Summer 2010Why Pig Latin?• Map-reduce is a low-level programming environment• In most applications need more complex queriesPi thih l l i itt iPiLti•Pig accepts higher level queries written in Pig Latin, translates them into ensembles of MapReduce jobs– Pig is the system– Pig Latin is the language3CSE 444 - Summer 2010Pig Engine Overview•Data model = loosely typednested relations•Data model = loosely typed nested relations• Query model = a sql-like, dataflow language• Execution model:Option 1: run locally on your machine–Option 1: run locally on your machine– Option 2: compile into sequence of map/reduce, run on a cluster supporting Hadoop(e.g., AWS)pp gp(g, )• Main idea: use Opt1 to debug, Opt2 to execute4CSE 444 - Summer 2010Pig Engine OverviewPi L tiPig Latin program5Example• Input: a table of urls: (url, category, pagerank)• Compute the average pagerank of all sufficiently high pageranks, for each category• Return the answers only for categories with sufficiently many such pages6CSE 444 - Summer 2010First in SQL…SELECT category, AVG(pagerank)gy,(p g)FROM urlsWHEREpagerank>0.2WHERE pagerank 0.2GROUP By categoryHAVING COUNT(*)>106HAVING COUNT( ) > 107CSE 444 - Summer 2010…then in Pig-Latingood_urls = FILTER urls BY pagerank > 0.2groups = GROUP good_urls BY categorybig_groups = FILTER groups BY COUNT(good_urls) > 106output = FOREACH big_groups GENERATEcategory, AVG(good_urls.pagerank)Pig Latin combines •high-level declarative querying in the spirit of SQL,and8highlevel declarative querying in the spirit of SQL, and• low-level, procedural programming a la map-reduce.Types in Pig-Latin• Atomic: string or number, e.g. ‘Alice’ or 55• Tuple: (‘Alice’, 55, ‘salesperson’)• Bag: {(‘Alice’, 55, ‘salesperson’),(‘Betty’,44, ‘manager’), …}(y,, g),}•Maps: we will try not to use theseaps e y o o use ese9CSE 444 - Summer 2010Types in Pig-LatinBags can be nested !• {(‘a’, {1,4,3}), (‘c’,{ }), (‘d’, {2,2,5,3,2})}Tuple components can be referenced by number•$0 $1 $2$0, $1, $2, …10CSE 444 - Summer 201011Loading data• Input data = FILES !– Heard that before ?• The LOAD command parses an input file into a bag of records• Both parser (=“deserializer”) and output type are provided by user12CSE 444 - Summer 2010Loading dataqueries = LOAD ‘query_log.txt’USING userfuction( )AS (userID, queryString, timeStamp)(qy gp)13CSE 444 - Summer 2010Loading data• USING userfuction( ) -- is optional– Default deserializer expects tab-delimited file• AS type – is optional– Default is a record with unnamed fields; refer to them $$as $0, $1, …• The return value of LOAD is just a handle to a bagTh t l di i d i ll d ll li d–The actual reading is done in pull mode, or parallelized14CSE 444 - Summer 2010FOREACHexpanded_queries = FOREACH queriesGENERATE userId, expandQuery(queryString)expandQuery( ) is a UDF* that produces likely expansionsNote: it returns a bag henceexpanded queriesis a nested bagNote: it returns a bag, hence expanded_queriesis a nested bag*UDF = User Defined Function15CSE 444 - Summer 2010FOREACHexpanded_queries = FOREACH queriesGENERATE userId, flatten(expandQuery(queryString))Now we get a flat collection16Now we get a flat collectionCSE 444 - Summer 201017CSE 444 - Summer 2010FLATTENNote that it is NOT a first class function !(that’s one thing I* don’t like about Pig-latin)• First class FLATTEN:– FLATTEN({{2,3},{5},{},{4,5,6}}) = {2,3,5,4,5,6}{{ }}{}–Type: {{T}} {T}• Pig-latin FLATTENFLATTEN({4 5 6}) 4 5 6–FLATTEN({4,5,6}) = 4, 5, 6– Type: {T}  T, T, T, …, T ?????* “I” = original author of these slides. Opinions might or might not be consistent from quarter to quarter. ☺18CSE 444 - Summer 2010FILTERliFILTER i BYId‘bt’Remove all queries from Web bots:real_queries = FILTER queries BY userIdneq‘bot’Better: use a complex UDF to detect Web bots:real_queries = FILTER queries BY NOTisBot(userId)19BY NOT isBot(userId)CSE 444 - Summer 2010JOINres lts {(q er String rl position)}results: {(queryString, url, position)}revenue: {(queryString, adSlot, amount)}join_result = JOIN results BY queryStringrevenue BYqueryStringrevenue BY queryStringjoin_result : {(queryString, url, position, adSlot, amount)}20CSE 444 - Summer 201021CSE 444 - Summer 2010GROUP BYre en e {(q er String adSlot amo nt)}grouped_revenue = GROUP revenue BY queryStringrevenue: {(queryString, adSlot, amount)}query_revenues =FOREACH grouped_revenueGENERATE queryString,SUM(revenue.amount) AS totalRevenuegrouped revenue: {(queryString {(adSlot amount)})}22grouped_revenue: {(queryString, {(adSlot, amount)})}query_revenues: {(queryString, totalRevenue)}Simple Map-Reduceinp t {(field1 field2 field3 )}map_result = FOREACH input input : {(field1, field2, field3, . . . .)}GENERATE FLATTEN(map(*))key_groups = GROUP map_result BY $0output = FOREACH key_groupsGENERATE reduce($1)mapresult : {(a1, a2, a3, . . .)}23p_{( , , , )}key_groups : {(a1, {(a2, a3, . . .)})}Where we are…• Previously…– LOAD – read data– FOREACH – with and without flatten–FILTER–JOIN– GROUP BYNow•Now…– COGROUP: A generic way to group tuples from two datasets togethertwo datasets together24CSE 444 - Summer 2010Co-GroupDataset 1 results:{(queryStringurlposition)}grouped data=Dataset 1 results: {(queryString, url, position)}Dataset 2 revenue: {(queryString, adSlot, amount)}grouped_data= COGROUP results BY queryString,revenue BY queryString;qy g;grouped_data: {(queryString, results:{(url, position)}, revenue:{(adSlot, amount)})}revenue:{(adSlot, amount)})}What is the output type in general ?{group idbag dataset 1 bag dataset 2}25CSE 444 - Summer 2010{group_id, bag dataset 1, bag dataset 2}Co-GroupIs this an inner join or an outer join ?26CSE 444 - Summer 2010Co-Groupgrouped_data: {(queryString, results:{(url, position)}, revenue:{(adSlot, amount)})}url_revenues = FOREACH grouped_dataGENERATEFLATTEN(distributeRevenue(results, revenue));hdi t ib t Ri UDF th t t h…where distributeRevenueis a UDF that accepts search results and revenue information for a query string at a time, and outputs a bag of urls and the revenue

View Full Document


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2-15-16-17-32-33 out of 33 pages.

UW CSE 444 - Study Notes

Sign up for free to view:

Please select your school