DOC PREVIEW
USC CSCI 585 - MapReduce

This preview shows page 1-2-3-27-28-29 out of 29 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 29 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 29 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 29 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 29 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 29 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 29 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 29 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

MapReduce Dean and Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. Communications of the ACM, Vol. 51, No. 1, January 2008.A Shared-Nothing FrameworkOverview: Map/Reduce (Hadoop)ExampleOther ExamplesMapReduceImplementationExecutionOutput of ExecutionSlide 10MasterWorker FailuresMaster FailureSlide 14Sequential versus Parallel ExecutionSlide 16Slide 17Load BalancingLoad Imbalance: StagglersFunction ShippingHow to Debug?Slide 22Slide 23Monitoring of MapReducePerformance NumbersSlide 26Startup with GrepSortSort ResultsMapReduceMapReduceDean and Ghemawat. MapReduce: Simplified Data Dean and Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. Communications of Processing on Large Clusters. Communications of the ACM, Vol. 51, No. 1, January 2008.the ACM, Vol. 51, No. 1, January 2008.Shahram GhandeharizadehShahram GhandeharizadehComputer Science DepartmentComputer Science DepartmentUniversity of Southern CaliforniaUniversity of Southern CaliforniaA Shared-Nothing FrameworkA Shared-Nothing FrameworkShared-nothing architecture consisting of Shared-nothing architecture consisting of thousands of nodes!thousands of nodes!A node is an off-the-shelf, commodity PC.A node is an off-the-shelf, commodity PC.Google File SystemGoogle File SystemGoogle’s Bigtable Data ModelGoogle’s Bigtable Data ModelGoogle’s Map/Reduce FrameworkGoogle’s Map/Reduce FrameworkYahoo’s Pig Latin Yahoo’s Pig Latin …………..Overview: Map/Reduce (Hadoop)Overview: Map/Reduce (Hadoop)A programming model to make parallelism A programming model to make parallelism transparent to a programmer.transparent to a programmer.Programmer specifies:Programmer specifies:a map function that processes a key/value pair to a map function that processes a key/value pair to generate a set of intermediate key/value pairs.generate a set of intermediate key/value pairs.Divides the problem into smaller “intermediate key/value” Divides the problem into smaller “intermediate key/value” sub-problems.sub-problems.a reduce function to merge all intermediate values a reduce function to merge all intermediate values associated with the same intermediate key.associated with the same intermediate key.Solve each sub-problem.Solve each sub-problem.Final results might be stored across R files.Final results might be stored across R files.Run-time system takes care of:Run-time system takes care of:Partitioning the input data across nodes,Partitioning the input data across nodes,Scheduling the program’s execution,Scheduling the program’s execution,Node failures,Node failures,Coordination among multiple nodes.Coordination among multiple nodes.Example Example Counting word occurrences:Counting word occurrences:Input document is NameList and its content is: “Jim Shahram Betty Jim Input document is NameList and its content is: “Jim Shahram Betty Jim Shahram Jim Shahram”Shahram Jim Shahram”Desired output:Desired output:Jim: 3Jim: 3Shahram: 3Shahram: 3Betty: 1Betty: 1How?How?Map(String doc_name, String doc_content)Map(String doc_name, String doc_content)// doc_name is document name, NameList// doc_name is document name, NameList//doc_content is document content, “Jim Shahram …”//doc_content is document content, “Jim Shahram …”For each word w in valueFor each word w in valueEmitIntermediate(w, “1”);EmitIntermediate(w, “1”);Map (NameList, “Jim Shahram Betty …”) emits:Map (NameList, “Jim Shahram Betty …”) emits:[Jim, 1], [Shahram, 1], [Betty, 1][Jim, 1], [Shahram, 1], [Betty, 1]A hash function may split different tokens across M different “Worker” processes.A hash function may split different tokens across M different “Worker” processes.Reduce (String key, Iterator values)Reduce (String key, Iterator values)// key is a word// key is a word// values is a list of counts// values is a list of countsInt result = 0;Int result = 0;For each v in valuesFor each v in valuesresult += ParseInt(v);result += ParseInt(v);Emit(AsString(result));Emit(AsString(result));Reduce (“Jim”, “1 1 1”) emits “3”Reduce (“Jim”, “1 1 1”) emits “3”Other ExamplesOther ExamplesDistributed Grep:Distributed Grep:Map function emits a line if it matches a supplied pattern.Map function emits a line if it matches a supplied pattern.Reduce function is an identity function that copies the supplied Reduce function is an identity function that copies the supplied intermediate data to the output.intermediate data to the output.Count of URL accesses:Count of URL accesses:Map function processes logs of web page requests and outputs Map function processes logs of web page requests and outputs <URL, 1>,<URL, 1>,Reduce function adds together all values for the same URL, Reduce function adds together all values for the same URL, emitting <URL, total count> pairs.emitting <URL, total count> pairs.Reverse Web-Link graph; e.g., all URLs with reference to Reverse Web-Link graph; e.g., all URLs with reference to http://dblab.usc.edu:http://dblab.usc.edu:Map function outputs <tgt, src> for each link to a tgt in a page Map function outputs <tgt, src> for each link to a tgt in a page named src,named src,Reduce concatenates the list of all src URLS associated with a Reduce concatenates the list of all src URLS associated with a given tgt URL and emits the pair: <tgt, list(src)>.given tgt URL and emits the pair: <tgt, list(src)>.Inverted Index; e.g., all URLs with 585 as a word:Inverted Index; e.g., all URLs with 585 as a word:Map function parses each document, emitting a sequence of Map function parses each document, emitting a sequence of <word, doc_ID>,<word, doc_ID>,Reduce accepts all pairs for a given word, sorts the Reduce accepts all pairs for a given word, sorts the corresponding doc_IDs and emits a <word, list(doc_ID)> pair.corresponding doc_IDs and emits a <word, list(doc_ID)> pair.Set of all output pairs forms a simple inverted index.Set of all output pairs forms a simple inverted index.MapReduceMapReduceInput: R = {r1, r2, …, rn}, user provided Input: R = {r1, r2, …, rn}, user provided functions M and Rfunctions M and RM(ri) M(ri)  {[K1, V1], [K2, V2], … } {[K1, V1], [K2, V2], … }[Jim, 1], [Shahram, 1], [Betty, 1], …[Jim, 1], [Shahram, 1], [Betty, 1], …[Jim, “1 1 1”], [Shahram, “1 1 1”], [Betty, “1”][Jim, “1 1 1”], [Shahram,


View Full Document

USC CSCI 585 - MapReduce

Download MapReduce
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view MapReduce and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view MapReduce 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?