DOC PREVIEW
Berkeley COMPSCI C267 - Cloud Computing with MapReduce and Hadoop

This preview shows page 1-2-3-4-30-31-32-33-34-62-63-64-65 out of 65 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 65 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 65 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 65 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 65 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 65 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 65 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 65 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 65 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 65 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 65 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 65 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 65 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 65 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 65 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Cloud Computing with MapReduce and HadoopWhat is Cloud Computing?What is MapReduce?What is MapReduce Used For?Example: Facebook LexiconSlide 6Slide 7OutlineMapReduce GoalsTypical Hadoop ClusterSlide 11ChallengesHadoop ComponentsHadoop Distributed File SystemMapReduce Programming ModelExample: Word CountWord Count ExecutionAn Optimization: The CombinerWord Count with CombinerMapReduce Execution DetailsFault Tolerance in MapReduceSlide 22Slide 23TakeawaysSlide 251. Search2. Sort3. Inverted IndexInverted Index Example4. Most Popular Words5. Numerical IntegrationSlide 32Introduction to HadoopWord Count in JavaSlide 35Slide 36Word Count in Python with Hadoop StreamingAmazon Elastic MapReduceElastic MapReduce UIElastic MapReduce UISlide 41Slide 42MotivationPigAn Example ProblemIn MapReduceIn Pig LatinTranslation to MapReduceSlide 49HiveConclusionsSlide 52Cloud ResearchSelf-Serving Example: SparkData FlowExample: Logistic RegressionSerial VersionSpark VersionPerformanceCrazy Idea: Interactive SparkSlide 61Can HPC Run in the Cloud?EC2 Latency vs InfinibandHPC Cloud ProjectsResourcesCloud Computing with MapReduce and HadoopMatei ZahariaUC Berkeley RAD [email protected] is Cloud Computing?•“Cloud” refers to large Internet services that run on 10,000’s of machines (Google, Yahoo!, etc)•More recently, “cloud computing” refers to services by these companies that let external customers rent cycles–Amazon EC2: virtual machines at 8.5¢/hour, billed hourly–Amazon S3: storage at 15¢/GB/month–Windows Azure: special applications using Azure API•Attractive features:–Scale: 100’s of nodes available in minutes–Fine-grained billing: pay only for what you use–Ease of use: sign up with credit card, get root accessWhat is MapReduce?•Data-parallel programming model for clusters of commodity machines•Pioneered by Google–Processes 20 PB of data per day•Popularized by open-source Hadoop project–Used by Yahoo!, Facebook, Amazon, …What is MapReduce Used For?•At Google:–Index building for Google Search–Article clustering for Google News–Statistical machine translation•At Yahoo!:–Index building for Yahoo! Search–Spam detection for Yahoo! Mail•At Facebook:–Data mining–Ad optimization–Spam detectionExample: Facebook Lexiconwww.facebook.com/lexiconExample: Facebook Lexiconwww.facebook.com/lexiconWhat is MapReduce Used For?•In research:–Analyzing Wikipedia conflicts (PARC)–Natural language processing (CMU) –Bioinformatics (Maryland)–Particle physics (Nebraska)–Ocean climate simulation (Washington)–<Your application here>Outline•MapReduce architecture•Sample applications•Introduction to Hadoop•Higher-level query languages: Pig & Hive•Current research•Clouds and HPCMapReduce Goals•Scalability to large data volumes:–Scan 100 TB on 1 node @ 50 MB/s = 24 days–Scan on 1000-node cluster = 35 minutes•Cost-efficiency:–Commodity nodes (cheap, but unreliable)–Commodity network–Automatic fault-tolerance (fewer admins)–Easy to use (fewer programmers)Typical Hadoop Cluster•40 nodes/rack, 1000-4000 nodes in cluster•1 Gbps bandwidth in rack, 8 Gbps out of rack•Node specs (Facebook):8 cores, 16 GB RAM, 8 x 1.5 TB disks, no RAIDAggregation switchRack switchTypical Hadoop ClusterChallenges•Cheap nodes fail, especially if you have many–Mean time between failures for 1 node = 3 years–MTBF for 1000 nodes = 1 day–Solution: Build fault-tolerance into system•Commodity network = low bandwidth–Solution: Push computation to the data•Programming distributed systems is hard–Solution: Users write data-parallel “map” and “reduce” functions, system handles work distribution and failuresHadoop Components•Distributed file system (HDFS)–Single namespace for entire cluster–Replicates data 3x for fault-tolerance•MapReduce framework–Runs jobs submitted by users–Manages work distribution & fault-tolerance–Colocated with file systemHadoop Distributed File System•Files split into 128MB blocks•Blocks replicated across several datanodes (usually 3)•Namenode stores metadata (file names, locations, etc)•Optimized for large files, sequential reads•Files are append-onlyNamenodeDatanodes11223344112244221133114433332244File1MapReduce Programming Model•Data type: key-value records•Map function:(Kin, Vin)  list(Kinter, Vinter)•Reduce function:(Kinter, list(Vinter))  list(Kout, Vout)Example: Word Countdef mapper(line): foreach word in line.split(): output(word, 1)def reducer(key, values): output(key, sum(values))Word Count Executionthe quickbrown foxthe fox atethe mousehow nowbrown cowMapMapMapMapMapMapReduceReduceReduceReducebrown, 2fox, 2how, 1now, 1the, 3ate, 1cow, 1mouse, 1quick, 1the, 1brown, 1fox, 1quick, 1the, 1fox, 1the, 1how, 1now, 1brown, 1ate, 1mouse, 1cow, 1Input Map Shuffle & Sort Reduce OutputAn Optimization: The Combiner•Local reduce function for repeated keys produced by same map•For associative ops. like sum, count, max•Decreases amount of intermediate data•Example: local counting for Word Count:def combiner(key, values): output(key, sum(values))Word Count with Combinerthe quickbrown foxthe fox atethe mousehow nowbrown cowMapMapMapMapMapMapReduceReduceReduceReducebrown, 2fox, 2how, 1now, 1the, 3ate, 1cow, 1mouse, 1quick, 1the, 1brown, 1fox, 1quick, 1the, 2fox, 1how, 1now, 1brown, 1ate, 1mouse, 1cow, 1Input Map Shuffle & Sort Reduce OutputMapReduce Execution Details•Mappers preferentially scheduled on same node or same rack as their input block–Push computation to data, minimize network use•Mappers save outputs to local disk before serving to reducers–Allows running more reducers than # of nodes–Allows recovery if a reducer crashesFault Tolerance in MapReduce•1. If a task crashes:–Retry on another node•OK for a map because it had no dependencies•OK for reduce because map outputs are on disk–If the same task repeatedly fails, fail the job or ignore that input blockNote: For fault tolerance to work, your tasks must be deterministic and side-effect-freeFault Tolerance in MapReduce•2. If a node crashes:–Relaunch its current tasks on other nodes–Relaunch any maps the node previously ran•Necessary because their output files were lost along with the crashed nodeFault Tolerance in MapReduce•3. If a task is going slowly (straggler):–Launch second copy of task on another node–Take the output of whichever copy finishes first, and


View Full Document

Berkeley COMPSCI C267 - Cloud Computing with MapReduce and Hadoop

Documents in this Course
Lecture 4

Lecture 4

52 pages

Split-C

Split-C

5 pages

Lecture 5

Lecture 5

40 pages

Load more
Download Cloud Computing with MapReduce and Hadoop
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Cloud Computing with MapReduce and Hadoop and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Cloud Computing with MapReduce and Hadoop 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?