DOC PREVIEW
UT Dallas CS 6350 - Lecture_on_Mapreduce_BigData#1

This preview shows page 1-2-20-21 out of 21 pages.

Save
View full document
Premium Document
Do you want full access? Go Premium and unlock all 21 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Big Data Mining and Analytics Lecture Parallel and Distributed Computing Dr Latifur Khan Department of Computer Science University of Texas at Dallas Material adapted from slides by Christophe Bisciglia Aaron Kimball Sierra Michels Slettvet Google Distributed Computing Seminar 2007 licensed under Creation Commons Attribution 3 0 License This work is licensed under a Creative Commons Attribution Noncommercial Share Alike 3 0 United States See http creativecommons org licenses by nc sa 3 0 us for details Big Data processing boils down to Divide and conquer Throwing more hardware at the problem Simple to understand a lifetime to master Parallel vs Distributed Parallel computing generally means Vector processing of data Multiple CPUs in a single computer Distributed computing generally means Multiple CPUs across many computers Flynn s Taxonomy Single SD Multiple MD Data Instructions Single SI Multiple MI SISD MISD Single threaded process Pipeline architecture SIMD MIMD Vector Processing Multi threaded Programming SISD Processor D D D D Instructions D D D SIMD Processor D0 D0 D0 D0 D0 D0 D0 D1 D1 D1 D1 D1 D1 D1 D2 D2 D2 D2 D2 D2 D2 D3 D3 D3 D3 D3 D3 D3 D4 D4 D4 D4 D4 D4 D4 Dn Dn Dn Dn Dn Dn Dn Instructions MIMD Processor D D D D D D D D D D Instructions Processor D D D D Instructions Parallel vs Distributed Processor D D D D D D D D D Network connection for data transfer D Instructions Shared Memory Processor D D D D Instructions Parallel Multiple CPUs within a shared memory machine Distributed Multiple machines with own memory connected over a network Divide and Conquer Work Partition w1 w2 w3 worker worker worker r1 r2 r3 Result Combine Parallelization Problems How do we assign work units to workers What if we have more work units than workers What if workers need to share partial results How do we aggregate partial results How do we know all the workers have finished What if workers die What is the common theme of all of these problems General Theme Parallelization problems arise from Communication between workers Access to shared resources e g data Thus we need a synchronization system This is tricky Finding bugs is hard Solving bugs is even harder Cloud Computing Lecture 2 From Lisp to MapReduce and GFS Material adapted from slides by Christophe Bisciglia Aaron Kimball Sierra Michels Slettvet Google Distributed Computing Seminar 2007 licensed under Creation Commons Attribution 3 0 License This work is licensed under a Creative Commons Attribution Noncommercial Share Alike 3 0 United States See http creativecommons org licenses by nc sa 3 0 us for details Today s Topics Lisp MapReduce GFS Functional Programming MapReduce Google File System GFS Functional Programming MapReduce functional programming meets distributed processing on steroids What is functional programming MapReduce GFS Computation as application of functions Theoretical foundation provided by lambda calculus How is it different Lisp Not a new idea dates back to the 50 s or even 30 s Traditional notions of data and instructions are not applicable Data flows are implicit in program Different orders of execution are possible Exemplified by LISP and ML Lisp MapReduce What does this have to do with MapReduce After all Lisp is about processing lists Two important concepts in functional programming Lisp MapReduce GFS Map do something to everything in a list Fold combine results of a list in some way Map Map is a higher order function How map works Function is applied to every element in a list Result is a new list f Lisp MapReduce GFS f f f f Fold Fold is also a higher order function How fold works Accumulator set to initial value Function applied to list element and the accumulator Result stored in the accumulator Repeated for every item in the list Result is the final value in the accumulator f Lisp MapReduce GFS Initial value f f f f final value Map Fold in Action Simple map example map lambda x x x 1 2 3 4 5 1 4 9 16 25 Fold examples fold 0 1 2 3 4 5 15 fold 1 1 2 3 4 5 120 Lisp MapReduce GFS Sum of squares define sum of squares v fold 0 map lambda x x x v sum of squares 1 2 3 4 5 55 Lisp MapReduce Let s assume a long list of records imagine if That s MapReduce and Hadoop Implicit parallelism Lisp MapReduce GFS We can distribute the execution of map operations to multiple nodes We have a mechanism for bringing map results back together in the fold operation We can parallelize execution of map operations since they are isolated We can reorder folding if the fold function is commutative and associative Typical Problem Lisp MapReduce GFS Iterate over a large number of records Map extract something of interest from each Shuffle and sort intermediate results Reduce aggregate intermediate results Generate final output Key idea provide an abstraction at the point of these two operations MapReduce Programmers specify two functions map k v k v reduce k v k v All v with the same k are reduced together Usually programmers also specify partition k number of partitions partition for k Often a simple hash of the key e g hash k mod n Allows reduce operations for different keys in parallel Lisp MapReduce GFS


View Full Document

UT Dallas CS 6350 - Lecture_on_Mapreduce_BigData#1

Documents in this Course
HW3

HW3

5 pages

NOSQL-CAP

NOSQL-CAP

23 pages

BigTable

BigTable

39 pages

HW3

HW3

5 pages

Load more
Download Lecture_on_Mapreduce_BigData#1
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture_on_Mapreduce_BigData#1 and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture_on_Mapreduce_BigData#1 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?