Cloud Computing - ICloud ComputingMapReduce: A group-by-aggregateShortcomingsPig Latin: A Not-So-Foreign Language for Data ProcessingPig PhilosophyFeaturesPig LatinExample Data Analysis TaskData FlowIn Pig LatinQuick Start and InteroperabilityOptional SchemasUDFs as First-class citizensOperatorsCOGROUP Vs JOINCompilation into MapReduceDebugging EnvironmentFuture WorkDryadLINQ: A System for General Purpose Distributed Data-Parallel Computing Using a High-Level LanguageDryad System ArchitectureLINQDryadLINQ ConstructsDryad + LINQ = DryadLINQDryadLINQ Execution OverviewSystem ImplementationStatic OptimizationsDynamic OptimizationsSlide 29EvaluationSlide 31Main BenefitsDiscussionComparisonImproving MapReduce Performance in Heterogeneous EnvironmentsHadoop Speculative Execution OverviewHadoop’s AssumptionsBreaking Down the AssumptionsSlide 39Slide 40LATE SchedulerSlide 42Performance Comparison Without StragglersPerformance Comparison With StragglersSlide 45SensitivitySlide 47TakeawaysFurther questionsPresenters: Abhishek Verma, Nicolas ZeaMap ReduceClean abstractionExtremely rigid 2 stage group-by aggregationCode reuse and maintenance difficultGoogle → MapReduce, SawzallYahoo → Hadoop, Pig LatinMicrosoft → Dryad, DryadLINQImproving MapReduce in heterogeneous environmentk1v1k2v2k1v3k2v4k1v5mapk1v1k1v3k1v5k2v2k2v4OutputrecordsmapreducereduceInputrecordsSplitSplitshufek1v1k1v3k2v2Local QSortk1v5k2v4Extremely rigid data flowOther flows hacked in Stages Joins SplitsCommon operations must be coded by handJoin, filter, projection, aggregates, sorting,distinctSemantics hidden inside map-reduce fnsDifficult to maintain, extend, and optimizeM RM RM RChristopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, Andrew TomkinsResearchPigs Eat AnythingCan operate on data w/o metadata : relational, nested, or unstructured.Pigs Live AnywhereNot tied to one particular parallel frameworkPigs Are Domestic AnimalsDesigned to be easily controlled and modified by its users.UDFs : transformation functions, aggregates, grouping functions, and conditionals.Pigs FlyProcesses data quickly(?)6Dataflow languageProcedural : different from SQLQuick Start and InteroperabilityNested Data ModelUDFs as First-Class CitizensParallelism RequiredDebugging Environment7Data ModelAtom : 'cs'Tuple: ('cs', 'ece', 'ee')Bag: { ('cs', 'ece'), ('cs')}Map: [ 'courses' → { ('523', '525', '599'}]ExpressionsFields by position $0Fields by name f1,Map Lookup #8Find the top 10 most visited pages in each categoryURLCategoryPageRankcnn.com News 0.9bbc.com News 0.8flickr.com Photos 0.7espn.com Sports 0.9Visits URL InfoUser URL TimeAmy cnn.com 8:00Amy bbc.com 10:00Amy flickr.com 10:05Fred cnn.com 12:00Load VisitsGroup by urlForeach urlgenerate countLoad Url InfoJoin on urlGroup by categoryForeach categorygenerate top10 urlsvisits = load ‘/data/visits’ as (user, url, time);gVisits = group visits by url;visitCounts = foreach gVisits generate url, count(visits);urlInfo = load ‘/data/urlInfo’ as (url, category,pRank);visitCounts = join visitCounts by url, urlInfo by url;gCategories = group visitCounts by category;topUrls = foreach gCategories generate top(visitCounts,10);store topUrls into ‘/data/topUrls’;visits = load ‘/data/visits’ as (user, url, time);gVisits = group visits by url;visitCounts = foreach gVisits generate url, count(visits);urlInfo = load ‘/data/urlInfo’ as (url, category,pRank);visitCounts = join visitCounts by url, urlInfo by url;gCategories = group visitCounts by category;topUrls = foreach gCategories generate top(visitCounts,10);store topUrls into ‘/data/topUrls’;Operates directly over filesvisits = load ‘/data/visits’ as (user, url, time);gVisits = group visits by url;visitCounts = foreach gVisits generate url, count(visits);urlInfo = load ‘/data/urlInfo’ as (url, category,pRank);visitCounts = join visitCounts by url, urlInfo by url;gCategories = group visitCounts by category;topUrls = foreach gCategories generate top(visitCounts,10);store topUrls into ‘/data/topUrls’;Schemas 0ptional can be assigned dynamicallyvisits = load ‘/data/visits’ as (user, url, time);gVisits = group visits by url;visitCounts = foreach gVisits generate url, count(visits);urlInfo = load ‘/data/urlInfo’ as (url, category,pRank);visitCounts = join visitCounts by url, urlInfo by url;gCategories = group visitCounts by category;topUrls = foreach gCategories generate top(visitCounts,10);store topUrls into ‘/data/topUrls’;UDFs can be used in every constructLOAD: specifying input dataFOREACH: per-tuple processingFLATTEN: eliminate nestingFILTER: discarding unwanted dataCOGROUP: getting related data togetherGROUP, JOINSTORE: asking for outputOther: UNION, CROSS, ORDER, DISTINCT15Every group or join operation forms a map-reduce boundaryOther operations pipelined into map and reduce phasesLoad VisitsGroup by urlForeach urlgenerate countLoad Url InfoJoin on urlGroup by categoryForeach categorygenerate top10 urlsMap1Reduce1Map2Reduce2Map3Reduce3Write-run-debug cycleSandbox datasetObjectives:RealismConcisenessCompletenessProblems:UDFs18Optional “safe” query optimizerPerforms only high-confidence rewritesUser interfaceBoxes and arrows UIPromote collaboration, sharing code fragments and UDFsTight integration with a scripting languageUse loops, conditionals of host languageYuan Yu, Michael Isard, Dennis Fetterly, Mihai Budiu,Ulfar Erlingsson, Pradeep Kumar Gunda, Jon CurreyFiles, TCP, FIFO, NetworkFiles, TCP, FIFO, Networkjob scheduledata planecontrol planeNSNSPDPDPDPDPDPDVV VJob manager clusterCollection<T> collection;bool IsLegal(Key);string Hash(Key);var results = from c in collection where IsLegal(c.key) select new { Hash(c.key), c.value};PartitionCollectionC# objectsPartitioning: Hash, Range, RoundRobinApply, ForkHintsCollection<T> collection;bool IsLegal(Key k);string Hash(Key);var results = from c in collection where IsLegal(c.key) select new { Hash(c.key), c.value};C#collectionresultsC# C# C#VertexcodeQueryplan(Dryad job)DataDryadLINQClient machine(11)Distributed query planC#Query ExprData centerOutput TablesResultsInput TablesInvokeQueryOutput
View Full Document