DOC PREVIEW
UT Dallas CS 6350 - 04. HadoopRevised2015Slide

This preview shows page 1-2-3-4-5-32-33-34-35-64-65-66-67-68 out of 68 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 68 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 68 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 68 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 68 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 68 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 68 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 68 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 68 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 68 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 68 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 68 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 68 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 68 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 68 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 68 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Slide 1Large-Scale Data AnalyticsWhy Hadoop is able to compete?What is HadoopSingle Node ArchitectureMotivation: Google ExampleCluster ArchitectureSlide 8Large-scale ComputingIdea and SolutionStorage InfrastructureDistributed File SystemDistributed File SystemWhat is HadoopHadoop Master/Slave ArchitectureDesign Principles of HadoopCommodity ClustersDesign Principles of HadoopHow Uses MapReduce/HadoopHadoop: How it WorksHadoop ArchitectureHadoop Distributed File System (HDFS)Slide 23Main Properties of HDFSMap-Reduce Execution Engine (Example: Color Count)Properties of MapReduce EngineProperties of MapReduce Engine (Cont’d)Isolated TasksKey-Value PairsProgramming Model: MapReduceTask: Word CountMapReduce: OverviewMapReduce: The Map StepMapReduce: The Reduce StepMore SpecificallyMapReduce: Word CountingWord Count Using MapReduceExample: Word CountMap-Reduce: EnvironmentMap-Reduce: A diagramMap-Reduce: In ParallelMap-ReduceData FlowCoordination: MasterRefinement: CombinersRefinement: CombinersCombiner FunctionsSlide 48Shuffle and Sort in HadoopPartitionsRefinement: Partition FunctionExample 2: Color CountExample 3: Color FilterTask Scheduling in MapReduceMap and Reduce Task SchedulingJob Scheduling in MapReduceExecution OverviewDealing with FailuresHow many Map and Reduce jobs?Fault Tolerance in HadoopWorker FailureMaster FailureFault Tolerance / WorkersRefinements: Backup TasksSpeculative ExecutionLocating StragglersTask Granularity & PipeliningBigger Picture: Hadoop vs. Other SystemsHadoop/MapReduce Computing Paradigmhttp://developer.yahoo.com/hadoop/tutorial/module4.htmlSpring 2014Taken from:WPI, Mohamed Eltabakh1Large-Scale Data Analytics•MapReduce computing paradigm (E.g., Hadoop) vs. Traditional database systems2Databasevs.Many enterprises are turning to HadoopEspecially applications generating big dataWeb applications, social networks, scientific applicationsWhy Hadoop is able to compete? 3Scalability (petabytes of data, thousands of machines)Databasevs.Flexibility in accepting all data formats (no schema)Commodity inexpensive hardwareEfficient and simple fault-tolerant mechanismPerformance (tons of indexing, tuning, data organization tech.) Features: - Provenance tracking - Annotation management - ….What is Hadoop•Hadoop is a software framework for distributed processing of large datasets across large clusters of computers•Large datasets  Terabytes or petabytes of data•Large clusters  hundreds or thousands of nodes •Hadoop is open-source implementation for Google MapReduce•Hadoop is based on a simple programming model called MapReduce•Hadoop is based on a simple data model, any data will fit4Single Node ArchitectureMemoryDiskCPUMachine Learning, Statistics“Classical” Data MiningJ. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org5Motivation: Google Example•20+ billion web pages x 20KB = 400+ TB•1 computer reads 30-35 MB/sec from disk•~4 months to read the web•~1,000 hard drives to store the web•Takes even more to do something useful with the data!•Today, a standard architecture for such problems is emerging:•Cluster of commodity Linux nodes•Commodity network (ethernet) to connect themJ. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org6Cluster ArchitectureMemDiskCPUMemDiskCPU…SwitchEach rack contains 16-64 nodesMemDiskCPUMemDiskCPU…SwitchSwitch1 Gbps between any pair of nodesin a rack2-10 Gbps backbone between racksIn 2011 it was guestimated that Google had 1M machines, http://bit.ly/Shh0RO J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org7J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org8Large-scale Computing•Large-scale computing for data mining problems on commodity hardware•Challenges:•How do you distribute computation?•How can we make it easy to write distributed programs?•Machines fail:•One server may stay up 3 years (1,000 days)•If you have 1,000 servers, expect to loose 1/day•People estimated Google had ~1M machines in 2011•1,000 machines fail every day!J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org9Idea and Solution•Issue: Copying data over a network takes time•Idea:•Bring computation close to the data•Store files multiple times for reliability•Map-reduce addresses these problems•Google’s computational/data manipulation model•Elegant way to work with big data•Storage Infrastructure – File system•Google: GFS. Hadoop: HDFS•Programming model•Map-ReduceJ. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org10Storage Infrastructure•Problem:•If nodes fail, how to store data persistently? •Answer:•Distributed File System:•Provides global file namespace•Google GFS; Hadoop HDFS;•Typical usage pattern•Huge files (100s of GB to TB)•Data is rarely updated in place•Reads and appends are commonJ. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org11Distributed File System•Chunk servers•File is split into contiguous chunks•Typically each chunk is 16-64MB•Each chunk replicated (usually 2x or 3x)•Try to keep replicas in different racks•Master node•a.k.a. Name Node in Hadoop’s HDFS•Stores metadata about where files are stored•Might be replicated•Client library for file access•Talks to master to find chunk servers •Connects directly to chunk servers to access dataJ. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org12Distributed File System•Reliable distributed file system•Data kept in “chunks” spread across machines•Each chunk replicated on different machines •Seamless recovery from disk or machine failureC0C1C2C5Chunk server 1D1C5Chunk server 3C1C3C5Chunk server 2…C2D0D0Bring computation directly to the data!C0C5Chunk server NC2D0J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org13Chunk servers also serve as compute serversWhat is Hadoop•Hadoop framework consists on two main layers•Distributed file system (HDFS)•Execution engine (MapReduce)14Hadoop Master/Slave Architecture•Hadoop is designed as a master-slave shared-nothing architecture15Master node (single node)Many slave nodesDesign Principles of Hadoop•Need to process big data •Need to parallelize computation across thousands of nodes•Commodity


View Full Document

UT Dallas CS 6350 - 04. HadoopRevised2015Slide

Documents in this Course
HW3

HW3

5 pages

NOSQL-CAP

NOSQL-CAP

23 pages

BigTable

BigTable

39 pages

HW3

HW3

5 pages

Load more
Download 04. HadoopRevised2015Slide
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view 04. HadoopRevised2015Slide and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view 04. HadoopRevised2015Slide 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?