Hadoop MapReduce Computing Paradigm http developer yahoo com hadoop tutorial module4 html Spring 2014 Taken from WPI Mohamed Eltabakh 1 Large Scale Data Analytics MapReduce computing paradigm E g Hadoop vs Traditional database systems Database vs Many enterprises are turning to Hadoop Especially applications generating big data Web applications social networks scientific applications 2 Why Hadoop is able to compete Database vs Scalability petabytes of data thousands of machines Performance tons of indexing tuning data organization tech Flexibility in accepting all data formats no schema Features Provenance tracking Annotation management Efficient and simple faulttolerant mechanism Commodity inexpensive hardware 3 What is Hadoop Hadoop is a software framework for distributed processing of large datasets across large clusters of computers Large datasets Terabytes or petabytes of data Large clusters hundreds or thousands of nodes Hadoop is open source implementation for Google MapReduce Hadoop is based on a simple programming model called MapReduce Hadoop is based on a simple data model any data will fit 4 Single Node Architecture CPU Machine Learning Statistics Memory Classical Data Mining Disk 5 J Leskovec A Rajaraman J Ullman Mining of Massive Datasets http www mmds org Motivation Google Example 20 billion web pages x 20KB 400 TB 1 computer reads 30 35 MB sec from disk 4 months to read the web 1 000 hard drives to store the web Takes even more to do something useful with the data Today a standard architecture for such problems is emerging Cluster of commodity Linux nodes Commodity network ethernet to connect them 6 J Leskovec A Rajaraman J Ullman Mining of Massive Datasets http www mmds org Cluster Architecture 2 10 Gbps backbone between racks 1 Gbps between any pair of nodes in a rack Switch Switch CPU Mem Disk Switch CPU CPU Mem Mem Disk Disk CPU Mem Disk Each rack contains 16 64 nodes J Leskovec A Rajaraman J In 2011 it was guestimated that Google had 1M machines Ullman http bit ly Shh0RO 7 Mining of Massive Datasets http www mmds org 8 J Leskovec A Rajaraman J Ullman Mining of Massive Datasets http www mmds org Large scale Computing Large scale computing for data mining problems on commodity hardware Challenges How do you distribute computation How can we make it easy to write distributed programs Machines fail One server may stay up 3 years 1 000 days If you have 1 000 servers expect to loose 1 day People estimated Google had 1M machines in 2011 1 000 machines fail every day 9 J Leskovec A Rajaraman J Ullman Mining of Massive Datasets http www mmds org Idea and Solution Issue Copying data over a network takes time Idea Bring computation close to the data Store files multiple times for reliability Map reduce addresses these problems Google s computational data manipulation model Elegant way to work with big data Storage Infrastructure File system Google GFS Hadoop HDFS Programming model Map Reduce 10 J Leskovec A Rajaraman J Ullman Mining of Massive Datasets http www mmds org Storage Infrastructure Problem If nodes fail how to store data persistently Answer Distributed File System Provides global file namespace Google GFS Hadoop HDFS Typical usage pattern Huge files 100s of GB to TB Data is rarely updated in place Reads and appends are common 11 J Leskovec A Rajaraman J Ullman Mining of Massive Datasets http www mmds org Distributed File System Chunk servers File is split into contiguous chunks Typically each chunk is 16 64MB Each chunk replicated usually 2x or 3x Try to keep replicas in different racks Master node a k a Name Node in Hadoop s HDFS Stores metadata about where files are stored Might be replicated Client library for file access Talks to master to find chunk servers Connects directly to chunk servers to access data 12 J Leskovec A Rajaraman J Ullman Mining of Massive Datasets http www mmds org Distributed File System Reliable distributed file system Data kept in chunks spread across machines Each chunk replicated on different machines recovery C1 C 0 Seamless D0 C1 C5 failure C2 Chunk server 1 C5 from Cdisk machineC0 Cor 2 5 C3 Chunk server 2 D0 D1 Chunk server 3 D0 C5 C2 Chunk server N Bring computation directly to the data Chunk servers also serve as compute J Leskovec A Rajaraman J 13 Ullman Mining of Massive servers Datasets http www mmds org What is Hadoop Hadoop framework consists on two main layers Distributed file system HDFS Execution engine MapReduce 14 Hadoop Master Slave Architecture Hadoop is designed as a master slave shared nothing architecture Master node single node Many slave nodes 15 Design Principles of Hadoop Need to process big data Need to parallelize computation across thousands of nodes Commodity hardware Large number of low end cheap machines working in parallel to solve a computing problem This is in contrast to Parallel DBs Small number of high end expensive machines 16 Commodity Clusters MapReduce is designed to efficiently process large volumes of data by connecting many commodity computers together to work in parallel A theoretical 1000 CPU machine would cost a very large amount of money far more than 1000 singleCPU or 250 quad core machines MapReduce ties smaller and more reasonably priced machines together into a single cost effective commodity cluster 17 Design Principles of Hadoop Automatic parallelization distribution Hidden from the end user Fault tolerance and automatic recovery Nodes tasks will fail and will recover automatically Clean and simple programming abstraction Users only provide two functions map and reduce 18 How Uses MapReduce Hadoop Google Inventors of MapReduce computing paradigm Yahoo Developing Hadoop open source of MapReduce IBM Microsoft Oracle Facebook Amazon AOL NetFlex Many others universities and research labs 19 Hadoop How it Works 20 Hadoop Architecture Distributed file system HDFS Execution engine MapReduce Master node single node Many slave nodes 21 Hadoop Distributed File System HDFS Centralized namenode Maintains metadata info about files File F 1 2 3 4 5 Blocks 64 MB Many datanode 1000s Store the actual data Files are divided into blocks Each block is replicated N times Default 3 22 HDFS Architecture m a n e s e dNameNode e l o fi aN 1 t a D d I Secondary ck l B NameNode Client 2 o 3 Read data DataNode s NameNode Maps a file to a file id and list of MapNodes DataNode Maps a block id to a physical location on disk Main Properties of HDFS Large A HDFS instance may consist of thousands of server
View Full Document