DOC PREVIEW
UT Dallas CS 6350 - 05.Lecture_on_HadoopBigData#3

This preview shows page 1-2-3-20-21-22-41-42-43 out of 43 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 43 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 43 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 43 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 43 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 43 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 43 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 43 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 43 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 43 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 43 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Slide 1Slide 2MapReduce: RecapSlide 4“Everything Else”Tools for SynchronizationPreserving StateScalable Hadoop Algorithms: ThemesImportance of Local AggregationSlide 10Slide 11Word Count: BaselineWord Count: Version 1Word Count: Version 2Design Pattern for Local AggregationCombiner DesignComputing the Mean: Version 1Computing the Mean: Version 2Computing the Mean: Version 3Computing the Mean: Version 4Algorithm Design: Running ExampleMapReduce: Large Counting ProblemsFirst Try: “Pairs”Pairs: Pseudo-Code“Pairs” AnalysisAnother Try: “Stripes”Stripes: Pseudo-Code“Stripes” AnalysisSlide 29Slide 30Relative Frequenciesf(B|A): “Stripes”f(B|A): “Pairs”f(B|A): “Pairs”f(B|A): “Pairs”“Order Inversion”Synchronization: Pairs vs. StripesSecondary SortingSecondary Sorting: SolutionsRecap: Tools for SynchronizationIssues and TradeoffsDebugging at ScaleQuestions?MapReduce Algorithm DesignBig Data Mining and AnalyticsThis work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United StatesSee http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for detailsDr. Latifur KhanDepartment of Computer ScienceUniversity of Texas at Dallas(Chapter 3: Jimmy Lin and Chris Dyer, Data-Intensive Text Processing with MapReduce, Morgan & Claypool Publishers, 2010.) http://lintool.github.com/MapReduceAlgorithms/Source: Wikipedia (Mahout)MapReduce: RecapProgrammers must specify:map (k, v) → <k’, v’>*reduce (k’, v’) → <k’, v’>*All values with the same key are reduced togetherOptionally, also:partition (k’, number of partitions) → partition for k’Often a simple hash of the key, e.g., hash(k’) mod nDivides up key space for parallel reduce operationscombine (k’, v’) → <k’, v’>*Mini-reducers that run in memory after the map phaseUsed as an optimization to reduce network trafficThe execution framework handles everything else…combinecombine combine combineba 1 2 c 9 a c5 2 b c7 8partition partition partition partitionmapmap map mapk1k2k3k4k5k6v1v2v3v4v5v6ba 1 2 c c3 6 a c5 2 b c7 8Shuffle and Sort: aggregate values by keysreduce reduce reducea 1 5 b 2 7 c 2 9 8r1s1r2s2r3s3“Everything Else”The execution framework handles everything else…Scheduling: assigns workers to map and reduce tasks“Data distribution”: moves processes to dataSynchronization: gathers, sorts, and shuffles intermediate dataErrors and faults: detects worker failures and restartsLimited control over data and execution flowAll algorithms must expressed in m, r, c, pYou don’t know:Where mappers and reducers runWhen a mapper or reducer begins or finishesWhich input a particular mapper is processingWhich intermediate key a particular reducer is processingTools for SynchronizationCleverly-constructed data structuresBring partial results togetherSort order of intermediate keysControl order in which reducers process keysPartitionerControl which reducer processes which keysPreserving state in mappers and reducersCapture dependencies across multiple keys and valuesPreserving StateMapper objectconfiguremapclosestateone object per taskReducer objectconfigurereduceclosestateone call per input key-value pairone call per intermediate keyAPI initialization hookAPI cleanup hookScalable Hadoop Algorithms: ThemesAvoid object creationInherently costly operationGarbage collectionAvoid bufferingLimited heap sizeWorks for small datasets, but won’t scale!Importance of Local AggregationIdeal scaling characteristics:Twice the data, twice the running timeTwice the resources, half the running timeWhy can’t we achieve this?Synchronization requires communicationCommunication kills performanceThus… avoid communication!Reduce intermediate data via local aggregationCombiners can helpSource: redrawn from a slide by Cloduera, cc-licensedMapperMapper Mapper Mapper MapperPartitionerPartitioner Partitioner Partitioner PartitionerIntermediatesIntermediates Intermediates Intermediates IntermediatesReducerReducer ReduceIntermediatesIntermediates Intermediates(combiners omitted here)combinecombine combine combineba 1 2 c 9 a c5 2 b c7 8partition partition partition partitionmapmap map mapk1k2k3k4k5k6v1v2v3v4v5v6ba 1 2 c c3 6 a c5 2 b c7 8Shuffle and Sort: aggregate values by keysreduce reduce reducea 1 5 b 2 7 c 2 9 8r1s1r2s2r3s3Word Count: BaselineWhat’s the impact of combiners?Word Count: Version 1Are combiners still needed?Word Count: Version 2Are combiners still needed?Key: preserve state acrossinput key-value pairs!Design Pattern for Local Aggregation“In-mapper combining”Fold the functionality of the combiner into the mapper by preserving state across multiple map callsAdvantagesSpeedWhy is this faster than actual combiners?DisadvantagesExplicit memory management requiredPotential for order-dependent bugsCombiner DesignCombiners and reducers share same method signatureSometimes, reducers can serve as combinersOften, not…Remember: combiner are optional optimizationsShould not affect algorithm correctnessMay be run 0, 1, or multiple timesExample: find average of all integers associated with the same keyComputing the Mean: Version 1Why can’t we use reducer as combiner?Computing the Mean: Version 2Why doesn’t this work?Computing the Mean: Version 3Fixed?Computing the Mean: Version 4Are combiners still needed?Algorithm Design: Running ExampleTerm co-occurrence matrix for a text collectionM = N x N matrix (N = vocabulary size)Mij: number of times i and j co-occur in some context (for concreteness, let’s say context = sentence)Why?Distributional profiles as a way of measuring semantic distanceSemantic distance useful for many language processing tasksMapReduce: Large Counting ProblemsTerm co-occurrence matrix for a text collection= specific instance of a large counting problemA large event space (number of terms)A large number of observations (the collection itself)Goal: keep track of interesting statistics about the eventsBasic approachMappers generate partial countsReducers aggregate partial countsHow do we aggregate partial counts efficiently?First Try: “Pairs”Each mapper takes a sentence:Generate all co-occurring term pairsFor all pairs, emit (a, b) → countReducers sum up counts associated with these pairsUse combiners!Pairs:


View Full Document

UT Dallas CS 6350 - 05.Lecture_on_HadoopBigData#3

Documents in this Course
HW3

HW3

5 pages

NOSQL-CAP

NOSQL-CAP

23 pages

BigTable

BigTable

39 pages

HW3

HW3

5 pages

Load more
Download 05.Lecture_on_HadoopBigData#3
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view 05.Lecture_on_HadoopBigData#3 and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view 05.Lecture_on_HadoopBigData#3 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?