Unformatted text preview:

Cloud Tools OverviewHadoopOutlineHadoop - Why ?Who uses Hadoop?Commodity HardwareHadoop Distributed File System (HDFS)Goals of HDFSDistributed File SystemHDFS ArchitectureFunctions of a NameNodeNameNode MetadataDataNodeBlock PlacementHeartbeatsReplication EngineData CorrectnessNameNode FailureData PiepliningRebalancerSecondary NameNodeUser InterfaceMapReduceMapReduce - What?MapReduce - DataflowMapReduce - FeaturesWord Count ExampleWord Count DataflowWord Count MapperWord Count ReducerSlide 31Putting it all togetherSlide 33Input and Output FormatsHow many Maps and ReducesSome handy toolsPartitionersCombinersCompressionCountersSpeculative executionZero ReducesDistributed File CacheToolFinding the Shortest PathFinding the Shortest Path: IntuitionFrom Intuition to AlgorithmWhat This Gives UsBlow-up and TerminationHadoop SubprojectsHadoop Related SubprojectsPigSlide 53An Example ProblemIn MapReduceIn Pig LatinEase of TranslationSlide 58HBaseHBase - What?HBase - Data ModelHBase - Data StorageHBase - CodeHBase - QueryingHiveSlide 66Creating a Hive TableA Simple QueryAggregation and JoinsUsing a Hadoop Streaming Mapper ScriptStormSlide 72Before StormBefore Storm – Adding a workerProblemsWhat we wantStorm ClusterConceptsStreamsSpoutsBoltsTopologySlide 83Stream GroupingSlide 85UT DALLASUT DALLASErik Jonsson School of Engineering & Computer ScienceFEARLESS engineeringCloud Tools OverviewUT DALLASUT DALLASErik Jonsson School of Engineering & Computer ScienceFEARLESS engineeringHadoopFEARLESS engineeringOutline•Hadoop - Basics•HDFS –Goals–Architecture–Other functions•MapReduce–Basics–Word Count Example–Handy tools–Finding shortest path example•Related Apache sub-projects (Pig, HBase,Hive)FEARLESS engineeringHadoop - Why ?•Need to process huge datasets on large clusters of computers•Very expensive to build reliability into each application•Nodes fail every day–Failure is expected, rather than exceptional–The number of nodes in a cluster is not constant•Need a common infrastructure–Efficient, reliable, easy to use–Open Source, Apache LicenceFEARLESS engineeringWho uses Hadoop?•Amazon/A9•Facebook•Google•New York Times•Veoh•Yahoo!•…. many moreFEARLESS engineeringCommodity Hardware•Typically in 2 level architecture–Nodes are commodity PCs–30-40 nodes/rack–Uplink from rack is 3-4 gigabit–Rack-internal is 1 gigabitAggregation switchRack switchUT DALLASUT DALLASErik Jonsson School of Engineering & Computer ScienceFEARLESS engineeringHadoop Distributed File System (HDFS)Original Slides byDhruba BorthakurApache Hadoop Project Management CommitteeFEARLESS engineeringGoals of HDFS•Very Large Distributed File System–10K nodes, 100 million files, 10PB•Assumes Commodity Hardware–Files are replicated to handle hardware failure–Detect failures and recover from them•Optimized for Batch Processing–Data locations exposed so that computations can move to where data resides–Provides very high aggregate bandwidthFEARLESS engineeringDistributed File System•Single Namespace for entire cluster•Data Coherency–Write-once-read-many access model–Client can only append to existing files•Files are broken up into blocks–Typically 64MB block size–Each block replicated on multiple DataNodes•Intelligent Client–Client can find location of blocks–Client accesses data directly from DataNodeFEARLESS engineeringHDFS ArchitectureFEARLESS engineeringFunctions of a NameNode•Manages File System Namespace–Maps a file name to a set of blocks–Maps a block to the DataNodes where it resides•Cluster Configuration Management•Replication Engine for BlocksFEARLESS engineeringNameNode Metadata•Metadata in Memory–The entire metadata is in main memory–No demand paging of metadata•Types of metadata–List of files–List of Blocks for each file–List of DataNodes for each block–File attributes, e.g. creation time, replication factor•A Transaction Log–Records file creations, file deletions etcFEARLESS engineeringDataNode•A Block Server–Stores data in the local file system (e.g. ext3)–Stores metadata of a block (e.g. CRC)–Serves data and metadata to Clients•Block Report–Periodically sends a report of all existing blocks to the NameNode•Facilitates Pipelining of Data–Forwards data to other specified DataNodesFEARLESS engineeringBlock Placement•Current Strategy–One replica on local node–Second replica on a remote rack–Third replica on same remote rack–Additional replicas are randomly placed•Clients read from nearest replicas•Would like to make this policy pluggableFEARLESS engineeringHeartbeats•DataNodes send hearbeat to the NameNode–Once every 3 seconds•NameNode uses heartbeats to detect DataNode failureFEARLESS engineeringReplication Engine•NameNode detects DataNode failures–Chooses new DataNodes for new replicas–Balances disk usage–Balances communication traffic to DataNodesFEARLESS engineeringData Correctness•Use Checksums to validate data–Use CRC32•File Creation–Client computes checksum per 512 bytes–DataNode stores the checksum•File access–Client retrieves the data and checksum from DataNode–If Validation fails, Client tries other replicasFEARLESS engineeringNameNode Failure•A single point of failure•Transaction Log stored in multiple directories–A directory on the local file system–A directory on a remote file system (NFS/CIFS)•Need to develop a real HA solutionFEARLESS engineeringData Pieplining•Client retrieves a list of DataNodes on which to place replicas of a block•Client writes block to the first DataNode•The first DataNode forwards the data to the next node in the Pipeline•When all replicas are written, the Client moves on to write the next block in fileFEARLESS engineeringRebalancer•Goal: % disk full on DataNodes should be similar–Usually run when new DataNodes are added–Cluster is online when Rebalancer is active–Rebalancer is throttled to avoid network congestion–Command line toolFEARLESS engineeringSecondary NameNode•Copies FsImage and Transaction Log from Namenode to a temporary directory•Merges FSImage and Transaction Log into a new FSImage in temporary directory•Uploads new FSImage to the NameNode–Transaction Log on NameNode is purgedFEARLESS engineeringUser Interface•Commads for HDFS User:–hadoop dfs -mkdir /foodir–hadoop dfs -cat /foodir/myfile.txt–hadoop dfs -rm /foodir/myfile.txt•Commands for HDFS


View Full Document

UTD CS 6V81 - LECTURE NOTES

Documents in this Course
Botnets

Botnets

33 pages

Privacy

Privacy

27 pages

Privacy

Privacy

27 pages

Load more
Download LECTURE NOTES
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view LECTURE NOTES and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view LECTURE NOTES 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?