DOC PREVIEW
UT Dallas CS 6350 - mapreducejoins-140222133821-phpapp02

This preview shows page 1-2-15-16-17-32-33 out of 33 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 33 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 33 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 33 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 33 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 33 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 33 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 33 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 33 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Slide 1Slide 2Slide 3Slide 4Slide 5Slide 6Slide 7Slide 8Slide 9Slide 10Slide 11Slide 12Slide 13Slide 14Slide 15Slide 16Slide 17Slide 18Slide 19Slide 20Slide 21Slide 22Slide 23Slide 24Slide 25Slide 26Slide 27Slide 28Slide 29Slide 30Slide 31Slide 32Slide 33MapReduce JoinsShalish.V.JA Refresher on JoinsA join is an operation that combines records from two or more data sets based on a field or set of fields, known as the foreign keyThe foreign key is the field in a relational table that matches the column of another tableSample Data Sets : A & BTwo data sets A and B, with the foreign key defined as fInner JoinOuter Joins : Left & RightFull Outer & Anti JoinsReduce Side JoinJoin large multiple data sets together by some foreign keyCan be used to execute any of the types of joinsNo limitation on the size of data setsRequire a large amount of network bandwidthReduce Side Join : StructureMapper prepares the join operation•Takes each input record from each of the data sets•Extracts foreign key from each record•output key : foreign key•output value : entire input record which is flagged by some unique identifier for the data setReducer performs the desired join operation by collecting the values of each input group into temporary lists•lists are then iterated over and the records from both sets are joined together•output is a number of part files equivalent to the number of reduce tasksReduce Side Join : StructureSample Data Sets : A & BTwo data sets A and B, with the foreign key defined as fReduce Side Join : Driver CodeMultipleInputs data types : allows to create a mapper class and input format fordifferent data sourcesReduce Side Join : MappersEach mapper class outputs the user ID as the foreign key and entire record as the value along with a single character to flag which record came from what setUserJoinMapperCommentJoinMapperReduce Side Join : Reducer Reducer copies all values for each group in memory, keeping track of which record came from what data setReduce Side Join : Reducer – Inner Join For an inner join, a joined record is output if all the lists are not emptyReduce Side Join : Reducer – Left Outer Join If the right list is not empty, join A with B. If the right list is empty, output each record of A with an empty string.Reduce Side Join : Reducer – Right Outer Join If the left list is not empty, join A with B. If the left list is empty, output each record of A with an empty string.Reduce Side Join : Reducer – Full Outer Join If list A is not empty, then for every element in A, join with B when the B list is not empty, or output A by itself. If A is empty, then just output B.Reduce Side Join : Reducer – AntiJoin For an antijoin, if at least one of the lists is empty, output the records from the nonempty list with an empty Text object.Performance AnalysisA plain reduce side join puts a lot of strain on the cluster’s networkReplicated JoinJoin operation between one large and many small data sets that can be performed on the map-sideCompletely eliminates the need to shuffle any data to the reduce phaseAll the data sets except the very large one are essentially read into memory during the setup phase of each map task, which is limited by the JVM heapJoin is done entirely in the map phase, with the very large data set being the input for the MapReduce job.Restriction : a replicated join is really useful only for an inner or a left outer join where the large data set is the “left” data setOutput is a number of part files equivalent to the number of map tasksReplicated Join : ApplicabilityThe type of join to execute is an inner join or a left outer join, with the large input data set being the “left” part of the operationAll of the data sets, except for the large one, can be fit into main memory of each map taskReplicated Join : StructureThe mapper is responsible for reading all files from the distributed cache during the setup phase and storing them into in-memory lookup tablesMapper processes each record and joins it with all the data stored in-memoryReplicated Join : StructureReplicated user-comment exampleSmall set of user information and a large set of commentjoin operation between one large and many data sets performed on the map-sideseliminates the need to shuffle any data to the reduce phaseuseful only for an inner or a left outer join where the large data set is the “left” data setReplicated Join : Mapper : setup All the data sets except the very large one are read into memory during the setup phase of each map task which is limited by the JVM heapuser ID is pulled out of the record.user ID and record are added to a HashMap for retrievalin the map methodReplicated Join : Mapper : mapconsecutive calls to the map method are performed. For each input record,the user ID is pulled from the comment. This user ID is then used to retrieve a value from the HashMap If a value is found, the input value is output along with the retrieved value. If a value is not found, but a left outer join is being executedComposite JoinJoin operation that can be performed on the map-side with many very large formatted inputsEliminates the need to shuffle and sort all the data to the reduce phaseData sets must first be sorted by foreign key, partitioned by foreign keyHadoop has built in support for a composite join using the CompositeInputFormat.This join utility is restricted to only inner and full outer joinsThe inputs for each mapper must be partitioned and sorted in a specific way, and each input dataset must be divided into the same number of partitions.All the records for a particular foreign key must be in the same partitionComposite JoinDriver code handles most of the work in the job configuration stageIt sets up the type of input format used to parse the data sets, as well asthe join type to executeThe framework then handles executing the actual join when the data is readMapper is very trivial. The two values are retrieved from the input tuple andsimply output to the file systemOutput is a number of part files equivalent to the number of map tasksPerformance AnalysisComposite join can be executed relatively quickly over large data setsData preparation needs to taken into account in the performance of thisanalyticComposite user comment joinuser and comment data sets have been preprocessed by MapReduce and output using the TextOutputFormatkey of each data set is the user ID, and the value is the user


View Full Document

UT Dallas CS 6350 - mapreducejoins-140222133821-phpapp02

Documents in this Course
HW3

HW3

5 pages

NOSQL-CAP

NOSQL-CAP

23 pages

BigTable

BigTable

39 pages

HW3

HW3

5 pages

Load more
Download mapreducejoins-140222133821-phpapp02
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view mapreducejoins-140222133821-phpapp02 and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view mapreducejoins-140222133821-phpapp02 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?