Developing a Simple Map Reduce Program for Hadoop The University of Texas at Dallas Big Data Course CS6350 Professor Dr Latifur Khan TA Gbadebo Ayoade gga110020 utdallas edu Release Date Spring 2015 Content courtesy of Mohammad Ridwanur Rahman Mohammad Ali Ghaderi Revised byGbadeboAyoade Introduction The purpose of this document is to help those students who are not familiar with Hadoop to develop their frst Map Reduce program for Hadoop So far from HW 0 So far from HW 0 we have a hadoop cluster in our machine and we know how to run a jar But next questions comes in is How to write a map reduce program How to get the jar of the map reduce program We will demonstrate that and explain the WordCount example code The process We assume that you already have Hadoop on your own machine and now you are ready to develop your frst Hadoop program This document based on Ubuntu 14 04 and Hadoop 2 6 0 In the following we will discuss the steps in details 1 Preparing the IDE Hadoop programs are Java programs You may use any Java IDE such as Eclipse NetBeans IntelliJ IDEA to develop your Map Reduce program We are going to use Eclipse in this document If you have Eclipse on your own machine you can skip this section To install Eclipse you can run this command in the shell sudo apt get install eclipse Wait for it to be downloaded Then use eclipse command to run the environment eclipse The default workspace should be good Click on OK Then goto Workbench 2 New Java Project Hadoop projects are simple Java projects Create a new Java project Write project name as MyWordCount and click on Finish to create the project 3 Creating main fle Create a new fle named WordCount java and write the following lines there import java io IOException import java util StringTokenizer import import import import import import import import import org apache hadoop conf Configuration org apache hadoop fs Path org apache hadoop io org apache hadoop mapreduce Job org apache hadoop mapreduce Mapper org apache hadoop mapreduce Reducer org apache hadoop mapreduce lib input FileInputFormat org apache hadoop mapreduce lib output FileOutputFormat org apache hadoop util GenericOptionsParser public class WordCount public static class Map extends Mapper LongWritable Text Text IntWritable private final static IntWritable one new IntWritable 1 private Text word new Text type of output key public void map LongWritable key Text value Context context throws IOException InterruptedException StringTokenizer itr new StringTokenizer value toString line to string token while itr hasMoreTokens word set itr nextToken context write word one set word as each input keyword create a pair keyword 1 public static class Reduce extends Reducer Text IntWritable Text IntWritable private IntWritable result new IntWritable public void reduce Text key Iterable IntWritable values Context context throws IOException InterruptedException int sum 0 initialize the sum for each keyword for IntWritable val values sum val get result set sum context write key result create a pair keyword number of occurences Driver program public static void main String args throws Exception Configuration conf new Configuration String otherArgs new GenericOptionsParser conf args getRemainingArgs get all args if otherArgs length 2 System err println Usage WordCount in out System exit 2 create a job with name wordcount Job job new Job conf wordcount job setJarByClass WordCount class job setMapperClass Map class job setReducerClass Reduce class uncomment the following line to add the Combiner job setCombinerClass Reduce class set output key type job setOutputKeyClass Text class set output value type job setOutputValueClass IntWritable class set the HDFS path of the input data FileInputFormat addInputPath job new Path otherArgs 0 set the HDFS path for the output FileOutputFormat setOutputPath job new Path otherArgs 1 Wait till job completion System exit job waitForCompletion true 0 1 4 Please download hadoop to your development machine Please download hadoop to your development machine This is required to get the dependent jar fles for hadoop compilation http mirror tcpdiag net apache hadoop common hadoop 2 6 0 hadoop2 6 0 tar gz 5 Adding Hadoop reference Very important In order to compile Hadoop projects you need to add Hadoop library as a reference to your projects Right click on the project Select Build Path Configure Build Paths select Libraries tab Click on Add External JARs to continue Find hadoop mapreduce client core2 6 0 jar in Your hadoop folder share hadoop mapreduce folder and add it Click on Add External JARs again Find hadoop common 2 6 0 jar in Your hadoop folder share hadoop common folder and add it You need also add commons cli 1 2 jar in the folder Your hadoop folder share hadoop common lib folder Your build path confguration should be similar to this screen now 6 Creating the JAR fle for Hadoop All you need to do now is to create the JAR fle and run it in Hadoop Right click on the project and choose export Then use Browse button in front of the JAR fle label to specify the name of the export fle For example you may use home user WordCountSample wordcount jar you can use any other path Now it should be two files inside WordCountSample folder 7 Executing the example in Hadoop Start the hortonworks VM as shown in the installation document Ensure the VM is properly started Get the IP of the VM following the steps in the installation document I am using the NAT confguration for the VM network so my IP is 127 0 0 1 7 1 Copy the wordcount jar to the hortonworks hadoop sandbox VM Change directory to where you exported the wordcount jar Use the scp command to copy the jar file from your development machine to hortonworks vm Windows users can download winscp to load files to the hortonworks vm scp P 2222 wordcount jar hue 127 0 0 1 NB use user hue password hadoop 7 2 L o g i n t o h o r t o n w o r k s V M d i r e c t l y o r y o u c a n u s e s s h o r p u t t y User should be hue To ssh from your development machine use the command below ssh hue 127 0 0 1 p 2222 password hadoop Removing old folders if you run the program again without deleting the previous output folder You will get the error Exception in thread main org apache hadoop mapred FileAlreadyExistsException Output directory user hue output already exists So lets delete if there is any same output directory already You do not need to delete input directory but it is shown here as well hdfs dfs rmr output In
View Full Document