Slide 1Big Data processing boils down to…Parallel vs. DistributedFlynn’s TaxonomySISDSIMDMIMDParallel vs. DistributedDivide and ConquerParallelization ProblemsGeneral Theme?Slide 12Today’s TopicsFunctional ProgrammingLisp MapReduce?MapFoldMap/Fold in ActionLisp MapReduceTypical ProblemMapReduceBig Data Mining and Analytics Lecture Parallel and Distributed ComputingDr. Latifur KhanDepartment of Computer ScienceUniversity of Texas at DallasThis work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United StatesSee http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for detailsMaterial adapted from slides by Christophe Bisciglia, Aaron Kimball, & Sierra Michels-Slettvet, Google Distributed Computing Seminar, 2007 (licensed under Creation Commons Attribution 3.0 License)Big Data processing boils down to…Divide-and-conquerThrowing more hardware at the problemSimple to understand… a lifetime to master…Parallel vs. DistributedParallel computing generally means:Vector processing of data Multiple CPUs in a single computer Distributed computing generally means:Multiple CPUs across many computersFlynn’s TaxonomyInstructionsSingle (SI) Multiple (MI)DataMultiple (MD)SISDSingle-threaded processMISDPipeline architectureSIMDVector ProcessingMIMDMulti-threaded ProgrammingSingle (SD)SISDD D D D D D DProcessorInstructionsSIMDD0ProcessorInstructionsD0D0D0D0D0D1D2D3D4…DnD1D2D3D4…DnD1D2D3D4…DnD1D2D3D4…DnD1D2D3D4…DnD1D2D3D4…DnD1D2D3D4…DnD0MIMDD D D D D D DProcessorInstructionsD D D D D D DProcessorInstructionsParallel vs. DistributedSharedMemoryParallel: Multiple CPUs within a shared memory machineDistributed: Multiple machines with own memory connected over a networkNetwork connectionfor data transferD D D D D D DProcessorInstructionsD D D D D D DProcessorInstructionsDivide and Conquer“Work”w1w2w3r1r2r3“Result”“worker” “worker” “worker”PartitionCombineParallelization ProblemsHow do we assign work units to workers?What if we have more work units than workers?What if workers need to share partial results?How do we aggregate partial results?How do we know all the workers have finished?What if workers die?What is the common theme of all of these problems?General Theme?Parallelization problems arise from:Communication between workersAccess to shared resources (e.g., data)Thus, we need a synchronization system!This is tricky:Finding bugs is hardSolving bugs is even harderCloud Computing Lecture #2From Lisp to MapReduce and GFSThis work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United StatesSee http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for detailsMaterial adapted from slides by Christophe Bisciglia, Aaron Kimball, & Sierra Michels-Slettvet, Google Distributed Computing Seminar, 2007 (licensed under Creation Commons Attribution 3.0 License)Today’s TopicsFunctional ProgrammingMapReduceGoogle File System (GFS)LispMapReduceGFSFunctional ProgrammingMapReduce = functional programming meets distributed processing on steroids Not a new idea… dates back to the 50’s (or even 30’s)What is functional programming?Computation as application of functionsTheoretical foundation provided by lambda calculusHow is it different?Traditional notions of “data” and “instructions” are not applicableData flows are implicit in programDifferent orders of execution are possibleExemplified by LISP and MLLispMapReduceGFSLisp MapReduce?What does this have to do with MapReduce?After all, Lisp is about processing listsTwo important concepts in functional programmingMap: do something to everything in a listFold: combine results of a list in some wayLispMapReduceGFSMapMap is a higher-order functionHow map works:Function is applied to every element in a listResult is a new listf f f f fLispMapReduceGFSFoldFold is also a higher-order functionHow fold works:Accumulator set to initial valueFunction applied to list element and the accumulatorResult stored in the accumulatorRepeated for every item in the listResult is the final value in the accumulatorf f f f f final valueInitial valueLispMapReduceGFSMap/Fold in ActionSimple map example:Fold examples:Sum of squares:(map (lambda (x) (* x x)) '(1 2 3 4 5)) '(1 4 9 16 25)(fold + 0 '(1 2 3 4 5)) 15(fold * 1 '(1 2 3 4 5)) 120(define (sum-of-squares v) (fold + 0 (map (lambda (x) (* x x)) v)))(sum-of-squares '(1 2 3 4 5)) 55LispMapReduceGFSLisp MapReduceLet’s assume a long list of records: imagine if...We can distribute the execution of map operations to multiple nodesWe have a mechanism for bringing map results back together in the fold operationThat’s MapReduce! (and Hadoop)Implicit parallelism:We can parallelize execution of map operations since they are isolatedWe can reorder folding if the fold function is commutative and associativeLispMapReduceGFSTypical ProblemIterate over a large number of recordsMap: extract something of interest from eachShuffle and sort intermediate resultsReduce: aggregate intermediate resultsGenerate final outputKey idea: provide an abstraction at the point of these two operationsLispMapReduceGFSMapReduceProgrammers specify two functions:map (k, v) → <k’, v’>*reduce (k’, v’) → <k’, v’>*All v’ with the same k’ are reduced togetherUsually, programmers also specify:partition (k’, number of partitions ) → partition for k’Often a simple hash of the key, e.g. hash(k’) mod nAllows reduce operations for different keys in
View Full Document