Unformatted text preview:

MR Language Reference Manual Siyang Dai Jinxiong Tan Zhi Zhang Zeyang Yu Shuai Yuan sd2694 jt2649 zz2219 zy2156 sy2420 1 MR Language Reference Manual 1 Introduction 1 1 Concept of MapReduce 1 2 Data flow of MapReduce 1 3 The MR Programming Language 1 4 Input and Output of MR Program 2 Lexical Elements 2 1 Tokens 2 2 Constants 2 3 Keywords 2 4 Identifiers 2 5 Operators 2 6 Separators 2 7 Comments 3 Data Types 3 1 Int 3 2 Double 3 3 Boolean 3 4 List 3 5 Conversions 4 Program Structure 4 1 Configuration 4 2 Mapper Reducer Definition 4 3 Scope 5 Expression 5 1 Operators 5 2 Primary Expression 5 3 Unary Negative Operator 5 4 Binop Operation 5 5 Split Operation 5 6 Assignment Expression 5 7 Declaration Expression 6 Statements 6 1 Expression Statement 6 2 Block statement 6 3 Emit Statement 6 4 Conditional Statement 6 5 Iteration Statement 7 Reference 1 Introduction 2 MapReduce is a programming paradigm to support distributed computing on large data sets on clusters of computer The paradigm is inspired by the map and reduce functions universally used in functional programming The MR programming language is designed specifically for MapReduce 1 1 Concept of M1 1 1 List Processing Essentially a MapReduce program convert lists of input data elements into lists of output data elements The transformation is done by two phases map and reduce 1 1 2 Map The first phase of a MapReduce program is called mapping A list of data pairs are provided one at a time to a function called the Mapper which transforms each input element individually to an output data element Logically a map function is defined as the following form Map k1 v1 list k2 v2 Figure 1 Map1 After that all pairs with the same key from all lists generated by map function will be grouped together thus creating one group for each one of the different generated keys The groups will be the input of the next phase 1 1 3 Reduce Reducing allows you aggregate values together A reduce function receives a list of values with the same key It then combines these values together Logically a reduce function is defined as the following form Reduce k2 list v2 k3 v3 Figure 2 Reduce 1Figure 1 2 3 are from Hadoop Tutorial on Yahoo Developer Network 3 As a result we get a pair of k v for each distinct key generated by map function 1 2 Data flow of MapReduce Combining map and reduce we can have the following overview for the data flow of a MapReduce program on a cluster consisting of three nodes Figure 3 MapReduce 1 3 The MR Programming Language MR is designed to support MapReduce paradigm It hides the details of MapReduce framework from the programmers All the programmers need to do is to define a map function and a reduce function The program will be run according to the data flow of MapReduce 1 4 Input and Output of MR Program An MR program takes two arguments from command line The first one is the input directory And the second one is the output directory 1 4 1 Input All files under the input directory are used as input files MR treats each line of each input file as a separate record and performs no parsing It feeds the map function with the byte offset of the line as key and the line content as value Therefore for map function k1 is always an integer and v1 is always one line of text 1 4 2 Output The output directory must not exist before the MR program runs The MR program will create one automatically The output of reduce function will be written to files under the output directory 4 in form key t value per line 2 Lexical Elements 2 1 Tokens There are five kinds of tokens in MR i e literals keywords identifiers operators and other separators Blanks newlines and comments are ignored during lexical analysis except that they separate tokens 2 2 Constants 2 2 1 Text Constant Text constant is a string containing a sequence of characters surrounded by a pair of double quotes i e For example hello world is a Text constant Identical Text constants are the same All Text literal are immutable One thing to note is that in MR there is no character type Even a single character is Text constant type which can be regarded as an extended character set 2 2 2 Int Constant A Int constant refers to a integer consisting of a sequence of digits It supports signed and unsigned integers Int constant cannot start with a 0 digit zero All integers are default to be decimal base 10 For example 15 and 2012 are valid Int constant 2 2 3 Double Constant In MR a double constant refers to a floating constant which consists a integer part a decimal point and a fraction part In addition it supports an e followed by an optionally signed integer exponent The integer part and fraction part can be one digit or a sequence of digits Either of them can be missing but not both Also either the decimal point or the e and the exponent not both may be missing The following are valid Double constants 1 or 0 5e15 or 3e 3 or 2 or 1e5 2 3 Keywords The following words are reserved as the keywords which cannot be used otherwise Text Int Double Boolean List def if else foreach emit and or Mapper Reducer split by true false 2 4 Identifiers Identifiers are used for naming variables parameters and functions Identifier consists of a sequence of letters digits and the underscore but it must start with a letter Identifier should not be the keywords listed above It is case sensitive 2 5 Operators An operator is a special token that performs an operation such as addition or subtraction on 5 either one or two operands More details will be covered in later section 2 6 Separators A separator separates tokens Other separators Blanks newlines and comments are ignored during lexical analysis except the following 2 7 Comments is used to indicate the rest of the line is comment C Java style comment 3 Data Types 3 1 Int The 64 bit Int data type can hold integer values in the range of 9 223 372 036 854 775 808 to 9 223 372 036 854 775 807 3 2 Double The Double type covers a range from 4 94065645841246544e 324d to 1 79769313486231570e 308d positive or negative 3 3 Boolean A variable of Boolean may take on the values true and false only 3 4 List It is used as List T i e List Int represents a list of Int values It has unlimited size 3 5 Conversions When a value of Double type is converted to Int type the fractional part is discarded When a value of integral type is converted to Double type and the value is not exactly representable then the result may be either the next higher or next …


View Full Document

Columbia COMS W4115 - MR Language Reference Manual

Documents in this Course
YOLT

YOLT

13 pages

Lattakia

Lattakia

15 pages

EasyQL

EasyQL

14 pages

Photogram

Photogram

163 pages

Espresso

Espresso

27 pages

NumLang

NumLang

6 pages

EMPATH

EMPATH

14 pages

La Mesa

La Mesa

9 pages

JTemplate

JTemplate

238 pages

MATVEC

MATVEC

4 pages

TONEDEF

TONEDEF

14 pages

SASSi

SASSi

16 pages

JTemplate

JTemplate

39 pages

BATS

BATS

10 pages

Synapse

Synapse

11 pages

c.def

c.def

116 pages

TweaXML

TweaXML

108 pages

Load more
Loading Unlocking...
Login

Join to view MR Language Reference Manual and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view MR Language Reference Manual and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?