Hyun Duk Kim and Chia-Chi LinJanuary 16, 20101 Introduction Pig Latin DryadLINQ Comparison between Pig Latin and DryadLINQ Wave computing Related work Discussion2 Huge Amount of data analysisEspecially web service companies Need of parallel/distributed system Parallel DB Expensive at web scale, Limited SQL3 Map/Reduce◦ More procedural programming model. Popular cloud computing environment Emergence of parallel computing tools◦ Ease of programming User can just submit tasks in the specific form,then tools execute them in distributed manner. Ex. Hadoop, Dryad, …4- Too low-level, Rigid- Hard to maintain, Hard to reuse code- Re-implement common queries- Poor debugging environment- Redundant computing- Load imbalance- Success rate vs. Window sizePower of programming Optimization across jobsPig Latin, DryadLINQ Wave Computing5Christopher Olston, Benjamin Reed, Utkarsh Srivastava,Ravi Kumar, and Andrew Tomkins6DeclarativeSQL Low-level, ProceduralMap/reducePig Latin7 Find the users who tend to visit high-pagerank pagesSQLSELECT user FROM visits, user WHERE avgpr > 0.6IN ( SELECT user, AVG(pagerank) … one nested SQL queryPig LatinV_p = JOIN visits BY url, pages BY url;Users = GROUP v_p BY user;Useravg = FOREACH users GENERATE group, AVG(v_p.pagerank) AS avgpr;Answer = FILTER useravg BY avgpr > „0.5‟; … sequence of commandsJava Map/Reducepublic static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {… more than 100 lines8 Execution engine on atop Hadoop Open source project Mainly developing/using in YahooHadoopPigM/R codePig Latin code9 Find the users who tend to visit high-pagerankpagesURL Category PageRankcnn.com News 0.9bbc.com News 0.8flickr.com Photos 0.7espn.com Sports 0.9Visits URL InfoUserURLTimeAmycnn.com8:00Amybbc.com10:00Amyflickr.com10:05Fredcnn.com12:0010visits = LOAD „visits.txt‟ AS (user, url, time);pages = LOAD „pages.txt‟ AS (url, pagerank);v_p = JOIN visits BY url, pages BY url;users = GROUP v_p BY user;useravg = FOREACH users GENERATE group, AVG(v_p.pagerank) AS avgpr;answer = FILTER useravg BY avgpr > „0.5‟; 11visits = LOAD „visits.txt‟ AS (user, url, time);pages = LOAD „pages.txt‟ AS (url, pagerank);V_p = JOIN visits BY url, pages BY url;Users = GROUP v_p BY user;Useravg = FOREACH users GENERATE group, AVG(v_p.pagerank) AS avgpr;Answer = FILTER useravg BY avgpr > „0.5‟; visits: (Amy, cnn.com, 8am)(Amy, frogs.com, 9am)(Fred, snails.com, 11am)pages: (cnn.com, 0.8)(frogs.com, 0.8)(snails.com, 0.3)12visits = LOAD „visits.txt‟ AS (user, url, time);pages = LOAD „pages.txt‟ AS (url, pagerank);v_p = JOIN visits BY url, pages BY url;Users = GROUP v_p BY user;Useravg = FOREACH users GENERATE group, AVG(v_p.pagerank) AS avgpr;Answer = FILTER useravg BY avgpr > „0.5‟; visits: (Amy, cnn.com, 8am)(Amy, frogs.com, 9am)(Fred, snails.com, 11am)pages: (cnn.com, 0.8)(frogs.com, 0.8)(snails.com, 0.3)v_p: (Amy, cnn.com, 8am, cnn.com, 0.8)(Amy, frogs.com, 9am, frogs.com, 0.8)(Fred, snails.com, 11am, snails.com, 0.3)13visits = LOAD „visits.txt‟ AS (user, url, time);pages = LOAD „pages.txt‟ AS (url, pagerank);v_p = JOIN visits BY url, pages BY url;users = GROUP v_p BY user;Useravg = FOREACH users GENERATE group, AVG(v_p.pagerank) AS avgpr;Answer = FILTER useravg BY avgpr > „0.5‟; v_p: (Amy, cnn.com, 8am, cnn.com, 0.8)(Amy, frogs.com, 9am, frogs.com, 0.8)(Fred, snails.com, 11am, snails.com, 0.3)users: (Amy, { (Amy, cnn.com, 8am, cnn.com, 0.8)(Amy, frogs.com, 9am, frogs.com, 0.8) } )(Fred, { (Fred, snails.com, 11am, snails.com, 0.3) } )14visits = LOAD „visits.txt‟ AS (user, url, time);pages = LOAD „pages.txt‟ AS (url, pagerank);v_p = JOIN visits BY url, pages BY url;users = GROUP v_p BY user;Useravg = FOREACH users GENERATE group, AVG(v_p.pagerank) AS avgpr;Answer = FILTER useravg BY avgpr > „0.5‟; v_p: (Amy, cnn.com, 8am, cnn.com, 0.8)(Amy, frogs.com, 9am, frogs.com, 0.8)(Fred, snails.com, 11am, snails.com, 0.3)users: (Amy, { (Amy, cnn.com, 8am, cnn.com, 0.8)(Amy, frogs.com, 9am, frogs.com, 0.8) } )(Fred, { (Fred, snails.com, 11am, snails.com, 0.3) } )Nested data model15visits = LOAD „visits.txt‟ AS (user, url, time);pages = LOAD „pages.txt‟ AS (url, pagerank);v_p = JOIN visits BY url, pages BY url;users = GROUP v_p BY user;useravg = FOREACH users GENERATE group, AVG(v_p.pagerank) AS avgpr;Answer = FILTER useravg BY avgpr > „0.5‟; users: (Amy, { (Amy, cnn.com, 8am, cnn.com, 0.8)(Amy, frogs.com, 9am, frogs.com, 0.8) } )(Fred, { (Fred, snails.com, 11am, snails.com, 0.3) } )useravg: (Amy, 0.8)(Fred, 0.3)16visits = LOAD „visits.txt‟ AS (user, url, time);pages = LOAD „pages.txt‟ AS (url, pagerank);v_p = JOIN visits BY url, pages BY url;users = GROUP v_p BY user;useravg = FOREACH users GENERATE group, AVG(v_p.pagerank) AS avgpr;Answer = FILTER useravg BY avgpr > „0.5‟; users: (Amy, { (Amy, cnn.com, 8am, cnn.com, 0.8)(Amy, frogs.com, 9am, frogs.com, 0.8) } )(Fred, { (Fred, snails.com, 11am, snails.com, 0.3) } )useravg: (Amy, 0.8)(Fred, 0.3)Can use any UDFs17visits = LOAD „visits.txt‟ AS (user, url, time);pages = LOAD „pages.txt‟ AS (url, pagerank);v_p = JOIN visits BY url, pages BY url;users = GROUP v_p BY user;useravg = FOREACH users GENERATE group, AVG(v_p.pagerank) AS avgpr;answer = FILTER useravg BY avgpr > „0.5‟; useravg: (Amy, 0.8)(Fred, 0.3)answer: (Amy, 0.8)18Load visitsLoad pagesJoin by urlGroup by userForeach categorygenerate avg…19Load visitsLoad pagesJoin by urlGroup by userForeach categorygenerate avgMap1Reduce1Reduce2Map2Every group or join operation forms a map-reduce boundaryOther operations pipelined into map and reduce phases…20 Atom„alice‟ Tuple(„alice‟ , „lakers‟) Bag(„alice‟, „lakers‟)(„alice‟, („iPod‟, „apple‟)) Map[ „age‟ 20 ] Nested Data Model(Amy, { (Amy, cnn.com, 8am, cnn.com, 0.8) (Amy, frogs.com, 9am, frogs.com, 0.8) } )21 Specifying Input Data: LOAD Per-tuple Processing: FOREACH Discarding Unwanted Data: FILTER Getting Related Data Together: COGROUP Other Commends◦ UNION, CROSS, ORDER, DISTINCT Asking for Output: STOREVery Similar to SQL commands22 Pig Pen23 “Safe” optimizer Performs only
View Full Document