Big Data and Hadoop Developer Lesson 11 ZooKeeper Sqoop and Flume Copyright 2014 Simplilearn All rights reserved Copyright 2014 Simplilearn All rights reserved Objectives By the end of this lesson you will be able to Explain ZooKeeper and its role List the challenges faced in distributed processing Install and configure ZooKeeper Explain the concept of Sqoop Install and configure Sqoop Explain the concept of Flume Configure and run Flume Copyright 2014 Simplilearn All rights reserved Introduction to ZooKeeper ZooKeeper is an open source and high performance co ordination service for distributed applications It offers the following services Naming Locks and synchronization Configuration management Group services Copyright 2014 Simplilearn All rights reserved Features of ZooKeeper Some salient features of ZooKeeper are as follows Provides a simple and high performance kernel for building complex clients Provides distributed co ordination services for distributed applications Follows first in first out approach for executing jobs Allows synchronization serialization and co ordination of nodes in Hadoop cluster Comes with pipeline architecture to achieve a wait free approach Takes care of problems by using inbuilt algorithms for deadlock detection and prevention Applies multi processing approach to avoid the wait time for process execution Allows for distributed processing Copyright 2014 Simplilearn All rights reserved Challenges Faced in Distributed Applications The following are the common challenges faced in distributed applications Error prone coordination Race conditions Deadlocks Partial failures Inconsistencies Copyright 2014 Simplilearn All rights reserved Coordination The key points related to coordination are Group membership Leader election Dynamic configuration Status monitoring Queuing Critical sections Copyright 2014 Simplilearn All rights reserved Goals of ZooKeeper Following are the goals of ZooKeeper Serialization ensures avoidance of delay in read or write operations Reliability persists when an update is applied by a user in the cluster Atomicity does not allow partial results Any user update can either succeed or fail Simple Application Programming Interface or API provides an interface for development and implementation Copyright 2014 Simplilearn All rights reserved Uses of ZooKeeper The uses of ZooKeeper are as follows Configuration Message queue Notification Synchronization Copyright 2014 Simplilearn All rights reserved ZooKeeper Entities ZooKeeper comprises the following three entities Leader Follower Observer Copyright 2014 Simplilearn All rights reserved ZooKeeper Data Model ZooKeeper has a hierarchical namespace Each node in the namespace is called Znode Copyright 2014 Simplilearn All rights reserved ZooKeeper Services The following points are related to Znode Znode Types of Znode In memory data node Regular Hierarchical namespace Ephemeral Follows UNIX like notation Flags of Znode Sequential flag Copyright 2014 Simplilearn All rights reserved ZooKeeper Services contd Some features of Znode are as follows Watch mechanism feature Receives notification from Other features nodes Enables one time triggers Stores metadata or Timeout mechanism Permits allocation of configuration resources for limited time Stores information like period timestamp version Copyright 2014 Simplilearn All rights reserved Client API Functions Given below is a list of client API functions Create path data and flag Delete path and version setData path data and version Exist path and watch getChildren path and watch getData path and watch Sync path Copyright 2014 Simplilearn All rights reserved Recipe 1 Cluster Management Recipes are guidelines for using ZooKeeper to implement higher order functions Recipe for cluster management e g in cloud environments is given For each client host i where i 1 N watch on members create members host i as ephemeral nodes node join leave generates alert keep updating members host i periodically for node status changes load memory CPU etc Copyright 2014 Simplilearn All rights reserved Recipe 2 Leader Election Recipe for leader election is as follows All participants create an ephemeral sequential node on the same election path The node with smallest sequence number is the leader Follower node listens to the node with the next lower sequence number When the leader is removed go to election path and find a new leader When session expires check the election state and go to election if needed Copyright 2014 Simplilearn All rights reserved Recipe 3 Distributed Exclusive Lock Recipe for distributed exclusive lock function assuming there are N web crawler clients trying to acquire a lock on links data clients create an ephemeral sequential znode under the path Cluster locknode clients request a list of children for the lock znode i e locknode the client with the least ID according to natural ordering will hold the lock other clients set watches on the Znode with id immediately preceding its own id and periodically checks for the lock in case of notification and the client wishing to release a lock deletes the node which triggers the next client in line to acquire the lock Copyright 2014 Simplilearn All rights reserved Business Scenario As part of his current project Tim Burnet the AVP of IT infra ops anticipates that his superior Olivia Tyler the EVP of IT operations would ask him to work on a high performance coordination service for distributed applications Tim knows that he has to use ZooKeeper for this task He wants to be prepared and decides to install ZooKeeper The demo in the next section illustrates how to install and configure ZooKeeper Copyright 2014 Simplilearn All rights reserved Demo 1 View ZooKeeper nodes using CLI Copyright 2014 Simplilearn All rights reserved Why Sqoop Sqoop is an Apache Hadoop Eco system project whose responsibility is to import or export operations across relational databases Some reasons to use Sqoop are as follows SQL servers are deployed worldwide Nightly processing is done on SQL servers Allows to move certain part of data from traditional SQL DB to Hadoop Transferring data using script is inefficient and time consuming To handle large data through Ecosystem To bring processed data from Hadoop to the applications Copyright 2014 Simplilearn All rights reserved Why Sqoop contd Sqoop is required when database is imported from Relational Database RDB to Hadoop or vice versa Users must consider details such as ensuring Users
View Full Document