Slide 1Slide 2MapReduce: RecapSlide 4“Everything Else”Tools for SynchronizationPreserving StateScalable Hadoop Algorithms: ThemesImportance of Local AggregationSlide 10Slide 11Word Count: BaselineWord Count: Version 1Word Count: Version 2Design Pattern for Local AggregationCombiner DesignComputing the Mean: Version 1Computing the Mean: Version 2Computing the Mean: Version 3Computing the Mean: Version 4Algorithm Design: Running ExampleMapReduce: Large Counting ProblemsFirst Try: “Pairs”Pairs: Pseudo-Code“Pairs” AnalysisAnother Try: “Stripes”Stripes: Pseudo-Code“Stripes” AnalysisSlide 29Slide 30Relative Frequenciesf(B|A): “Stripes”f(B|A): “Pairs”f(B|A): “Pairs”f(B|A): “Pairs”“Order Inversion”Synchronization: Pairs vs. StripesSecondary SortingSecondary Sorting: SolutionsRecap: Tools for SynchronizationIssues and TradeoffsDebugging at ScaleQuestions?MapReduce Algorithm DesignBig Data Mining and AnalyticsThis work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United StatesSee http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for detailsDr. Latifur KhanDepartment of Computer ScienceUniversity of Texas at Dallas(Chapter 3: Jimmy Lin and Chris Dyer, Data-Intensive Text Processing with MapReduce, Morgan & Claypool Publishers, 2010.) http://lintool.github.com/MapReduceAlgorithms/Source: Wikipedia (Mahout)MapReduce: RecapProgrammers must specify:map (k, v) → <k’, v’>*reduce (k’, v’) → <k’, v’>*All values with the same key are reduced togetherOptionally, also:partition (k’, number of partitions) → partition for k’Often a simple hash of the key, e.g., hash(k’) mod nDivides up key space for parallel reduce operationscombine (k’, v’) → <k’, v’>*Mini-reducers that run in memory after the map phaseUsed as an optimization to reduce network trafficThe execution framework handles everything else…combinecombine combine combineba 1 2 c 9 a c5 2 b c7 8partition partition partition partitionmapmap map mapk1k2k3k4k5k6v1v2v3v4v5v6ba 1 2 c c3 6 a c5 2 b c7 8Shuffle and Sort: aggregate values by keysreduce reduce reducea 1 5 b 2 7 c 2 9 8r1s1r2s2r3s3“Everything Else”The execution framework handles everything else…Scheduling: assigns workers to map and reduce tasks“Data distribution”: moves processes to dataSynchronization: gathers, sorts, and shuffles intermediate dataErrors and faults: detects worker failures and restartsLimited control over data and execution flowAll algorithms must expressed in m, r, c, pYou don’t know:Where mappers and reducers runWhen a mapper or reducer begins or finishesWhich input a particular mapper is processingWhich intermediate key a particular reducer is processingTools for SynchronizationCleverly-constructed data structuresBring partial results togetherSort order of intermediate keysControl order in which reducers process keysPartitionerControl which reducer processes which keysPreserving state in mappers and reducersCapture dependencies across multiple keys and valuesPreserving StateMapper objectconfiguremapclosestateone object per taskReducer objectconfigurereduceclosestateone call per input key-value pairone call per intermediate keyAPI initialization hookAPI cleanup hookScalable Hadoop Algorithms: ThemesAvoid object creationInherently costly operationGarbage collectionAvoid bufferingLimited heap sizeWorks for small datasets, but won’t scale!Importance of Local AggregationIdeal scaling characteristics:Twice the data, twice the running timeTwice the resources, half the running timeWhy can’t we achieve this?Synchronization requires communicationCommunication kills performanceThus… avoid communication!Reduce intermediate data via local aggregationCombiners can helpSource: redrawn from a slide by Cloduera, cc-licensedMapperMapper Mapper Mapper MapperPartitionerPartitioner Partitioner Partitioner PartitionerIntermediatesIntermediates Intermediates Intermediates IntermediatesReducerReducer ReduceIntermediatesIntermediates Intermediates(combiners omitted here)combinecombine combine combineba 1 2 c 9 a c5 2 b c7 8partition partition partition partitionmapmap map mapk1k2k3k4k5k6v1v2v3v4v5v6ba 1 2 c c3 6 a c5 2 b c7 8Shuffle and Sort: aggregate values by keysreduce reduce reducea 1 5 b 2 7 c 2 9 8r1s1r2s2r3s3Word Count: BaselineWhat’s the impact of combiners?Word Count: Version 1Are combiners still needed?Word Count: Version 2Are combiners still needed?Key: preserve state acrossinput key-value pairs!Design Pattern for Local Aggregation“In-mapper combining”Fold the functionality of the combiner into the mapper by preserving state across multiple map callsAdvantagesSpeedWhy is this faster than actual combiners?DisadvantagesExplicit memory management requiredPotential for order-dependent bugsCombiner DesignCombiners and reducers share same method signatureSometimes, reducers can serve as combinersOften, not…Remember: combiner are optional optimizationsShould not affect algorithm correctnessMay be run 0, 1, or multiple timesExample: find average of all integers associated with the same keyComputing the Mean: Version 1Why can’t we use reducer as combiner?Computing the Mean: Version 2Why doesn’t this work?Computing the Mean: Version 3Fixed?Computing the Mean: Version 4Are combiners still needed?Algorithm Design: Running ExampleTerm co-occurrence matrix for a text collectionM = N x N matrix (N = vocabulary size)Mij: number of times i and j co-occur in some context (for concreteness, let’s say context = sentence)Why?Distributional profiles as a way of measuring semantic distanceSemantic distance useful for many language processing tasksMapReduce: Large Counting ProblemsTerm co-occurrence matrix for a text collection= specific instance of a large counting problemA large event space (number of terms)A large number of observations (the collection itself)Goal: keep track of interesting statistics about the eventsBasic approachMappers generate partial countsReducers aggregate partial countsHow do we aggregate partial counts efficiently?First Try: “Pairs”Each mapper takes a sentence:Generate all co-occurring term pairsFor all pairs, emit (a, b) → countReducers sum up counts associated with these pairsUse combiners!Pairs:
View Full Document