DOC PREVIEW
UT Dallas CS 6350 - 02. Lecture_on_Mapreduce_BigData#1

This preview shows page 1-2-20-21 out of 21 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 21 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 21 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 21 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 21 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 21 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Slide 1Big Data processing boils down to…Parallel vs. DistributedFlynn’s TaxonomySISDSIMDMIMDParallel vs. DistributedDivide and ConquerParallelization ProblemsGeneral Theme?Slide 12Today’s TopicsFunctional ProgrammingLisp  MapReduce?MapFoldMap/Fold in ActionLisp  MapReduceTypical ProblemMapReduceBig Data Mining and Analytics Lecture Parallel and Distributed ComputingDr. Latifur KhanDepartment of Computer ScienceUniversity of Texas at DallasThis work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United StatesSee http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for detailsMaterial adapted from slides by Christophe Bisciglia, Aaron Kimball, & Sierra Michels-Slettvet, Google Distributed Computing Seminar, 2007 (licensed under Creation Commons Attribution 3.0 License)Big Data processing boils down to…Divide-and-conquerThrowing more hardware at the problemSimple to understand… a lifetime to master…Parallel vs. DistributedParallel computing generally means:Vector processing of data Multiple CPUs in a single computer Distributed computing generally means:Multiple CPUs across many computersFlynn’s TaxonomyInstructionsSingle (SI) Multiple (MI)DataMultiple (MD)SISDSingle-threaded processMISDPipeline architectureSIMDVector ProcessingMIMDMulti-threaded ProgrammingSingle (SD)SISDD D D D D D DProcessorInstructionsSIMDD0ProcessorInstructionsD0D0D0D0D0D1D2D3D4…DnD1D2D3D4…DnD1D2D3D4…DnD1D2D3D4…DnD1D2D3D4…DnD1D2D3D4…DnD1D2D3D4…DnD0MIMDD D D D D D DProcessorInstructionsD D D D D D DProcessorInstructionsParallel vs. DistributedSharedMemoryParallel: Multiple CPUs within a shared memory machineDistributed: Multiple machines with own memory connected over a networkNetwork connectionfor data transferD D D D D D DProcessorInstructionsD D D D D D DProcessorInstructionsDivide and Conquer“Work”w1w2w3r1r2r3“Result”“worker” “worker” “worker”PartitionCombineParallelization ProblemsHow do we assign work units to workers?What if we have more work units than workers?What if workers need to share partial results?How do we aggregate partial results?How do we know all the workers have finished?What if workers die?What is the common theme of all of these problems?General Theme?Parallelization problems arise from:Communication between workersAccess to shared resources (e.g., data)Thus, we need a synchronization system!This is tricky:Finding bugs is hardSolving bugs is even harderCloud Computing Lecture #2From Lisp to MapReduce and GFSThis work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United StatesSee http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for detailsMaterial adapted from slides by Christophe Bisciglia, Aaron Kimball, & Sierra Michels-Slettvet, Google Distributed Computing Seminar, 2007 (licensed under Creation Commons Attribution 3.0 License)Today’s TopicsFunctional ProgrammingMapReduceGoogle File System (GFS)LispMapReduceGFSFunctional ProgrammingMapReduce = functional programming meets distributed processing on steroids Not a new idea… dates back to the 50’s (or even 30’s)What is functional programming?Computation as application of functionsTheoretical foundation provided by lambda calculusHow is it different?Traditional notions of “data” and “instructions” are not applicableData flows are implicit in programDifferent orders of execution are possibleExemplified by LISP and MLLispMapReduceGFSLisp  MapReduce?What does this have to do with MapReduce?After all, Lisp is about processing listsTwo important concepts in functional programmingMap: do something to everything in a listFold: combine results of a list in some wayLispMapReduceGFSMapMap is a higher-order functionHow map works:Function is applied to every element in a listResult is a new listf f f f fLispMapReduceGFSFoldFold is also a higher-order functionHow fold works:Accumulator set to initial valueFunction applied to list element and the accumulatorResult stored in the accumulatorRepeated for every item in the listResult is the final value in the accumulatorf f f f f final valueInitial valueLispMapReduceGFSMap/Fold in ActionSimple map example:Fold examples:Sum of squares:(map (lambda (x) (* x x)) '(1 2 3 4 5))  '(1 4 9 16 25)(fold + 0 '(1 2 3 4 5))  15(fold * 1 '(1 2 3 4 5))  120(define (sum-of-squares v) (fold + 0 (map (lambda (x) (* x x)) v)))(sum-of-squares '(1 2 3 4 5))  55LispMapReduceGFSLisp  MapReduceLet’s assume a long list of records: imagine if...We can distribute the execution of map operations to multiple nodesWe have a mechanism for bringing map results back together in the fold operationThat’s MapReduce! (and Hadoop)Implicit parallelism:We can parallelize execution of map operations since they are isolatedWe can reorder folding if the fold function is commutative and associativeLispMapReduceGFSTypical ProblemIterate over a large number of recordsMap: extract something of interest from eachShuffle and sort intermediate resultsReduce: aggregate intermediate resultsGenerate final outputKey idea: provide an abstraction at the point of these two operationsLispMapReduceGFSMapReduceProgrammers specify two functions:map (k, v) → <k’, v’>*reduce (k’, v’) → <k’, v’>*All v’ with the same k’ are reduced togetherUsually, programmers also specify:partition (k’, number of partitions ) → partition for k’Often a simple hash of the key, e.g. hash(k’) mod nAllows reduce operations for different keys in


View Full Document

UT Dallas CS 6350 - 02. Lecture_on_Mapreduce_BigData#1

Documents in this Course
HW3

HW3

5 pages

NOSQL-CAP

NOSQL-CAP

23 pages

BigTable

BigTable

39 pages

HW3

HW3

5 pages

Load more
Download 02. Lecture_on_Mapreduce_BigData#1
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view 02. Lecture_on_Mapreduce_BigData#1 and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view 02. Lecture_on_Mapreduce_BigData#1 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?