Wright CS 707 - Map Reduce Architecture

Unformatted text preview:

Map Reduce ArchitectureSingle-node architectureCommodity ClustersCluster ArchitectureStable storageDistributed File SystemMotivation for MapReduce (why)What is Map/ReduceMap in LISP (Scheme)Reduce in LISP (Scheme)Warm up: Word CountWord Count (2)Word Count (3)MapReduceWord Count using MapReduceCount, IllustratedModel is Widely Applicable MapReduce Programs In Google Source TreeImplementation OverviewDistributed Execution OverviewData flowCoordinationFailuresExecutionParallel ExecutionHow many Map and Reduce jobs?CombinersPartition FunctionExecution SummaryExercise 1: Host sizeExercise 2: Distributed GrepGrepExercise 3: Graph reversalReverse Web-Link GraphExercise 4: Frequent PairsHadoopReadingConclusionsPrasad L06MapReduce 1Map Reduce ArchitectureAdapted from Lectures byAnand Rajaraman (Stanford Univ.)and Dan Weld (Univ. of Washington)Single-node architectureMemoryDiskCPUMachine Learning, Statistics“Classical” Data MiningPrasad 2L06MapReduceCommodity ClustersWeb data sets can be very large Tens to hundreds of terabytesCannot mine on a single server (why?)Standard architecture emerging:Cluster of commodity Linux nodesGigabit ethernet interconnectHow to organize computations on this architecture?Mask issues such as hardware failurePrasad 3L06MapReduceCluster ArchitectureMemDiskCPUMemDiskCPU…SwitchEach rack contains 16-64 nodesMemDiskCPUMemDiskCPU…SwitchSwitch1 Gbps between any pair of nodesin a rack2-10 Gbps backbone between racksPrasad 4L06MapReduceStable storageFirst order problem: if nodes can fail, how can we store data persistently? Answer: Distributed File SystemProvides global file namespaceGoogle GFS; Hadoop HDFS; Kosmix KFSTypical usage patternHuge files (100s of GB to TB)Data is rarely updated in placeReads and appends are commonPrasad 5L06MapReduceDistributed File SystemChunk ServersFile is split into contiguous chunksTypically each chunk is 16-64MBEach chunk replicated (usually 2x or 3x)Try to keep replicas in different racksMaster nodea.k.a. Name Nodes in HDFSStores metadataMight be replicatedClient library for file accessTalks to master to find chunk servers Connects directly to chunkservers to access dataPrasad 6L06MapReduceMotivation for MapReduce (why)Large-Scale Data ProcessingWant to use 1000s of CPUsBut don’t want hassle of managing thingsMapReduce Architecture providesAutomatic parallelization & distributionFault toleranceI/O schedulingMonitoring & status updatesPrasad 7L06MapReduceWhat is Map/ReduceMap/Reduce Programming model from LISP(and other functional languages)Many problems can be phrased this wayEasy to distribute across nodesNice retry/failure semanticsPrasad 8L06MapReduceMap in LISP (Scheme)(map f list [list2 list3 …])(map square ‘(1 2 3 4))(1 4 9 16)Unary operatorPrasad 9L06MapReduceReduce in LISP (Scheme)(reduce f id list)(reduce + 0 ‘(1 4 9 16))(+ 16 (+ 9 (+ 4 (+ 1 0)) ) )30(reduce + 0 (map square (map – l1 l2))))Binary operatorPrasad 10L06MapReduceWarm up: Word CountWe have a large file of words, one word to a lineCount the number of times each distinct word appears in the fileSample application: analyze web server logs to find popular URLsPrasad 11L06MapReduceWord Count (2)Case 1: Entire file fits in memoryCase 2: File too large for mem, but all <word, count> pairs fit in memCase 3: File on disk, too many distinct words to fit in memorysort datafile | uniq –cPrasad 12L06MapReduceWord Count (3)To make it slightly harder, suppose we have a large corpus of documentsCount the number of times each distinct word occurs in the corpuswords(docs/*) | sort | uniq -cwhere words takes a file and outputs the words in it, one to a lineThe above captures the essence of MapReduceGreat thing is it is naturally parallelizablePrasad 13L06MapReduceMapReduceInput: a set of key/value pairsUser supplies two functions:map(k,v)  list(k1,v1)reduce(k1, list(v1))  v2(k1,v1) is an intermediate key/value pairOutput is the set of (k1,v2) pairsPrasad 14L06MapReduceWord Count using MapReducemap(key, value):// key: document name; value: text of documentfor each word w in value:emit(w, 1)reduce(key, values):// key: a word; values: an iterator over countsresult = 0for each count v in values:result += vemit(key,result)Prasad 15L06MapReduceCount, Illustratedmap(key=url, val=contents): For each word w in contents, emit (w, “1”)reduce(key=word, values=uniq_counts):Sum all “1”s in values listEmit result “(word, sum)”see bob runsee spot throwsee 1bob 1 run 1see 1spot 1throw 1bob 1 run 1see 2spot 1throw 1Prasad 16L06MapReduceExample uses: distributed grep O distributed sort O web link-graph reversal term-vector / hostweb access log stats inverted index construction document clustering machine learning statistical machine translation ... ... ...Model is Widely ApplicableMapReduce Programs In Google Source Tree Prasad 17L06MapReduceTypical cluster: • 100s/1000s of 2-CPU x86 machines, 2-4 GB of memory • Limited bisection bandwidth • Storage is on local IDE disks • GFS: distributed file system manages data (SOSP'03) • Job scheduling system: jobs made up of tasks, scheduler assigns tasks to machines Implementation is a C++ library linked into user programs Implementation OverviewPrasad 18L06MapReduceDistributed Execution Overview UserProgramWorkerWorkerMasterWorkerWorkerWorkerforkforkforkassignmapassignreducereadlocalwriteremoteread,sortOutputFile 0OutputFile 1writeSplit 0Split 1Split 2Input DataPrasad 19L06MapReduceData flowInput, final output are stored on a distributed file systemScheduler tries to schedule map tasks “close” to physical storage location of input dataIntermediate results are stored on local FS of map and reduce workersOutput is often input to another map reduce taskPrasad 20L06MapReduceCoordinationMaster data structuresTask status: (idle, in-progress, completed)Idle tasks get scheduled as workers become availableWhen a map task completes, it sends the master the location and sizes of its R intermediate files, one for each reducerMaster pushes this info to reducersMaster pings workers periodically to detect failuresPrasad 21L06MapReduceFailuresMap worker failureMap tasks completed or in-progress at worker are reset to idleReduce workers are notified when task is


View Full Document

Wright CS 707 - Map Reduce Architecture

Download Map Reduce Architecture
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Map Reduce Architecture and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Map Reduce Architecture 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?