Wright CS 707 - Programming Hadoop Map-Reduce

Unformatted text preview:

Programming Hadoop Map Reduce Programming Tuning Debugging Arun C Murthy Yahoo CCDI acm yahoo inc com ApacheCon US 2008 Existential angst Who am I Yahoo Grid Team CCDI Apache Hadoop Developer since April 2006 Core Committer Map Reduce Member of the Hadoop PMC Hadoop Overview Hadoop includes Distributed File System distributes data Map Reduce distributes application Open source from Apache Written in Java Runs on Linux Mac OS X Windows and Solaris Commodity hardware Distributed File System Designed to store large files Stores files as large blocks 64 to 128 MB Each block stored on multiple servers Data is automatically re replicated on need Accessed from command line Java API or C API bin hadoop fs put my file hdfs node1 50070 foo bar Path p new Path hdfs node1 50070 foo bar FileSystem fs p getFileSystem conf DataOutputStream file fs create p file writeUTF hello n file close Map Reduce Map Reduce is a programming model for efficient distributed computing It works like a Unix pipeline cat input grep sort unique c cat output Input Map Shuffle Sort Reduce Output Efficiency from Streaming through data reducing seeks Pipelining A good fit for a lot of applications Log processing Web index building Map Reduce features Fine grained Map and Reduce tasks Improved load balancing Faster recovery from failed tasks Automatic re execution on failure In a large cluster some nodes are always slow or flaky Introduces long tails or failures in computation Framework re executes failed tasks Locality optimizations With big data bandwidth to data is a problem Map Reduce HDFS is a very effective solution Map Reduce queries HDFS for locations of input data Map tasks are scheduled local to the inputs when possible Mappers and Reducers Every Map Reduce program must specify a Mapper and typically a Reducer The Mapper has a map method that transforms input key value pairs into any number of intermediate key value pairs The Reducer has a reduce method that transforms intermediate key value aggregates into any number of output key value pairs Map Reduce Dataflow Example 45 of all Hadoop tutorials count words 25 count sentences 20 are about paragraphs 10 are log parsers The remainder are helpful jandersen http twitter com jandersen statuses 926856631 Example Wordcount Mapper public static class MapClass extends MapReduceBase implements Mapper LongWritable Text Text IntWritable private final static IntWritable one new IntWritable 1 private Text word new Text public void map LongWritable key Text value OutputCollector Text IntWritable output Reporter reporter throws IOException String line value toString StringTokenizer itr new StringTokenizer line while itr hasMoreTokens word set itr nextToken output collect word one Example Wordcount Reducer public static class Reduce extends MapReduceBase implements Reducer Text IntWritable Text IntWritable public void reduce Text key Iterator IntWritable values OutputCollector Text IntWritable output Reporter reporter throws IOException int sum 0 while values hasNext sum values next get output collect key new IntWritable sum Input and Output Formats A Map Reduce may specify how it s input is to be read by specifying an InputFormat to be used InputSplit RecordReader A Map Reduce may specify how it s output is to be written by specifying an OutputFormat to be used These default to TextInputFormat and TextOutputFormat which process line based text data SequenceFile SequenceFileInputFormat and SequenceFileOutputFormat These are file based but they are not required to be Configuring a Job Jobs are controlled by configuring JobConf JobConfs are maps from attribute names to string value The framework defines attributes to control how the job is executed conf set mapred job name MyApp Applications can add arbitrary values to the JobConf conf set my string foo conf setInteger my integer 12 JobConf is available to all of the tasks Putting it all together Create a launching program for your application The launching program configures The Mapper and Reducer to use The output key and value types input types are inferred from the InputFormat The locations for your input and output Optionally the InputFormat and OutputFormat to use The launching program then submits the job and typically waits for it to complete Putting it all together public class WordCount public static void main String args throws IOException JobConf conf new JobConf WordCount class conf setJobName wordcount the keys are words strings conf setOutputKeyClass Text class the values are counts ints conf setOutputValueClass IntWritable class conf setMapperClass MapClass class conf setReducerClass Reduce class conf setInputPath new Path args 0 conf setOutputPath new Path args 1 JobClient runJob conf Non Java Interfaces Streaming Pipes C Pig Hive Jaql Cascading Streaming What about Unix hacks Can define Mapper and Reduce using Unix text filters Typically use grep sed python or perl scripts Format for input and output is key t value n Allows for easy debugging and experimentation Slower than Java programs bin hadoop jar hadoop streaming jar input in dir output out dir mapper streamingMapper sh reducer streamingReducer sh Mapper bin sed e s n g bin grep Reducer usr bin uniq c bin awk print 2 t 1 Pipes C C API and library to link application with C application is launched as a sub process of the Java task Keys and values are std string with binary data Word count map looks like class WordCountMap public HadoopPipes Mapper public WordCountMap HadoopPipes TaskContext context void map HadoopPipes MapContext context std vector std string words HadoopUtils splitString context getInputValue for unsigned int i 0 i words size i context emit words i 1 Pipes C The reducer looks like class WordCountReduce public HadoopPipes Reducer public WordCountReduce HadoopPipes TaskContext context void reduce HadoopPipes ReduceContext context int sum 0 while context nextValue sum HadoopUtils toInt context getInputValue context emit context getInputKey HadoopUtils toString sum Pipes C And define a main function to invoke the tasks int main int argc char argv return HadoopPipes runTask HadoopPipes TemplateFactory WordCountMap WordCountReduce void WordCountReduce Pig Hadoop Sub project Scripting language that generates Map Reduce jobs User uses higher level operations Group by Foreach Word Count input LOAD in dir USING TextLoader words FOREACH input GENERATE FLATTEN TOKENIZE grouped GROUP words BY 0 counts FOREACH grouped GENERATE group COUNT words


View Full Document

Wright CS 707 - Programming Hadoop Map-Reduce

Download Programming Hadoop Map-Reduce
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Programming Hadoop Map-Reduce and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Programming Hadoop Map-Reduce 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?