CSE 3302Lecture 24: MapReduceApril 27, 2010Nate NystromUTATuesday, April 27, 2010• its = the third person, neuter, possessive pronoun• it’s = a contraction of “it is”Tuesday, April 27, 2010Task parallelism• Intelligent task design eliminates as many synchronization points as possible, but some will be inevitable• Independent tasks can operate on different physical machines in distributed fashion• Good task design requires identifying common data and functionality to move as a unit Tuesday, April 27, 2010Tuesday, April 27, 2010Tuesday, April 27, 2010• 378,000 results in 0.17 seconds• including images and video• communicates with 1000s of machines• web server• index servers• document servers• spell checker• ad system• news, book, images, maps, ...Tuesday, April 27, 2010How?• Indexing• Google has a copy of the web on its servers• Google computes an index mapping search terms to URLs• blue -> { www.foo.com, www.bar.com, ... }• heeler -> { www.bar.com, ... }Tuesday, April 27, 2010How big is the web?Tuesday, April 27, 2010How big is the web?• Many billions of documents• One trillion URLs (2008)• Google has a copy of the web on its serversTuesday, April 27, 2010How many servers?Tuesday, April 27, 2010How many servers?• Google has ~450,000 servers (2007)• 20 megawatts (~12500 homes)• $2M / month for electricity• 20–100 petaflopsTuesday, April 27, 2010Tuesday, April 27, 2010Indexing the web• Want to index billions of documents stored on thousands of machines• ... in a reasonable timeTuesday, April 27, 2010Problem• Large amounts of raw data –– petabytes• web documents• logs• indices• the web graph• web crawl metadata• queries• More data than can fit on single machineTuesday, April 27, 2010SolutionTuesday, April 27, 2010Solution• Functional programming!Tuesday, April 27, 2010MapReduce• Framework for distributed computing on large data sets• Introduced by Google• Jeffrey Dean and Sanjay Ghemawat, OSDI 2004• Many other implementations• for: C++, Python, Erlang, Scala, Java, ...• Based on functional programming•Users provide two functions: map and reduceTuesday, April 27, 2010• 2008:• 100000 MapReduce jobs per day• Average of 400 machines per job• Average of 5–10 minutes per jobTuesday, April 27, 2010MapReduce• Goal: reduce complexity of distributed computation•Programmers just write map and reduce functions• Runtime system takes care of:• parallelizing the computation• data partitioning• scheduling• failures• communication• executing on a clusterTuesday, April 27, 2010Programming model• Computation takes a set of key–value pairs•Outputs a set of key–value pairs• Map:• takes a key–value pair, generates intermediate pairs• Reduce:• merges intermediate values associated with same keyTuesday, April 27, 2010Map• map: (Kin, Vin) => Seq[(Kout, Vtmp)]• Takes a key–value pair of one type• Returns a list of intermediate key–value pairs of another type• Applied in parallel to every item in the input set• produces a list for each callTuesday, April 27, 2010Reduce• reduce : (Kout, Seq[Vtmp]) => Seq[Vout]• Applied in parallel to each group• Typically returns an option type (some value or none)Tuesday, April 27, 2010Counting wordsdef map(name: String, doc: String): Seq[(String,Int)] = { doc.split(“ “).map(w => (w, 1)) }def reduce(word: String, counts: Seq[Int]): (String,Int) = { val count = counts.foldLeft(0)(_ + _) (word, count)}Tuesday, April 27, 2010grav.txt,“A screaming comesacross the sky.”1984.txt,“It was a bright cold dayin April, and the clockswere striking thirteen.”neuro.txt,“The sky above the portwas the color oftelevision, tuned to adead channel.”“a”, 1“screaming”, 1“comes”, 1“across”, 1“the”, 1“sky”, 1“it”, 1“was”, 1“a”, 1“bright”, 1“cold”, 1“day”, 1“in”, 1“april”, 1“and”, 1“the”, 1“clocks”, 1“were”, 1“striking”, 1“thirteen”, 1“the”, 1“sky”, 1“above”, 1“the”, 1“port”, 1“was”, 1“the”, 1“color”, 1“of”, 1“television”, 1“tuned”, 1“to”, 1“a”, 1“dead”, 1“channel”, 1mapTuesday, April 27, 2010“a”, 1“screaming”, 1“comes”, 1“across”, 1“the”, 1“sky”, 1“it”, 1“was”, 1“a”, 1“bright”, 1“cold”, 1“day”, 1“in”, 1“April”, 1“and”, 1“the”, 1“clocks”, 1“were”, 1“striking”, 1“thirteen”, 1“a”, [1,1,1]“above”, [1]“across”, [1]“and”, [1]“april”, [1]“bright”, [1]“channel”, [1]“clocks”, [1]“cold”, [1]“color”, [1]“comes”, [1]“day”, [1]“dead”, [1]“in”, [1]“it”, [1]“of”, [1]“port”, [1]“screaming”, [1]“sky”, [1,1]“striking”, [1]“television”, [1]“the”, [1,1,1,1,1]“thirteen”, [1]“to”, [1]“tuned”, [1]“was”, [1,1]“were”, [1]“the”, 1“sky”, 1“above”, 1“the”, 1“port”, 1“was”, 1“the”, 1“color”, 1“of”, 1“television”, 1“tuned”, 1“to”, 1“a”, 1“dead”, 1“channel”, 1partitionTuesday, April 27, 2010reduce“a”, [1,1,1]“above”, [1]“across”, [1]“and”, [1]“april”, [1]“bright”, [1]“channel”, [1]“clocks”, [1]“cold”, [1]“color”, [1]“comes”, [1]“day”, [1]“dead”, [1]“in”, [1]“it”, [1]“of”, [1]“port”, [1]“screaming”, [1]“sky”, [1,1]“striking”, [1]“television”, [1]“the”, [1,1,1,1,1]“thirteen”, [1]“to”, [1]“tuned”, [1]“was”, [1,1]“were”, [1]“a”, 3“above”, 1“across”, 1“and”, 1“april”, 1“bright”, 1“channel”, 1“clocks”, 1“cold”, 1“color”, 1“comes”, 1“day”, 1“dead”, 1“in”, 1“it”, 1“of”, 1“port”, 1“screaming”, 1“sky”, 2“striking”, 1“television”, 1“the”, 5“thirteen”, 1“to”, 1“tuned”, 1“was”, 2“were”, 1Tuesday, April 27, 2010Parallelism•If each map operation is independent, all mappers can be done in parallel•reduce requires all outputs of map with same key presented to same reducer at same timeTuesday, April 27, 2010Examples• grep• map emits a line if it matches a pattern• reduce copies the inputs to outputTuesday, April 27, 2010Examples• count url access frequency• map outputs (url,1) for each url in log• reduce adds counts and emits (url,
View Full Document