UT Dallas CS 6350 - BigDataHadoop_PPT_Lesson12 - D3105857

Home> Schools> University of Texas at Dallas> Computer Science (CS) > CS 6350> BigDataHadoop_PPT_Lesson12

DOC PREVIEW

UT Dallas CS 6350 - BigDataHadoop_PPT_Lesson12

School name University of Texas at Dallas

Course Cs 6350- Big Data Management and Analytics

Pages 47

This preview shows page 1-2-3-22-23-24-45-46-47 out of 47 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 47 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Big Data and Hadoop Developer Lesson 12 Ecosystem and Its Components Copyright 2014 Simplilearn All rights reserved Copyright 2014 Simplilearn All rights reserved Objectives Objectives By the end of this lesson you will be able to Explain the Hadoop ecosystem structure Describe the different components of Hadoop ecosystem and their roles Copyright 2014 Simplilearn All rights reserved Apache Hadoop Ecosystem The image displays the Hadoop ecosystem components as part of Apache Software Foundation projects The components are categorized into file system and data store serialization job execution and others as shown on the image Image source hadoopshere com Copyright 2014 Simplilearn All rights reserved File System Component The file system component of the Hadoop ecosystem is Hadoop Distributed File System HDFS A distributed file system that provides high throughput access Supported by NameNode Secondary NameNode and DataNodes in the Hadoop cluster Copyright 2014 Simplilearn All rights reserved Data Store Components Following are the data store components of the Hadoop ecosystem HBase Cassandra Accumulo Distributed scalable and big data store Highly scalable eventually consistent distributed and structured key value store Sorted distributed key value data storage and retrieval system Copyright 2014 Simplilearn All rights reserved Serialization Components Following are the serialization components of the Hadoop ecosystem Avro Data serialization system Trevni Thrift A column file format to permit compatible independent implementations that read and or write files in this format Framework for scalable cross language services development Copyright 2014 Simplilearn All rights reserved Job Execution Components Following are the job execution components of the Hadoop ecosystem MapReduce YARN Hama Framework which performs distributed data processing and comprises the JobTracker the TaskTracker and the JobHistoryServer Framework that facilitates the writing of arbitrary distributed processing frameworks and applications Pure BSP Bulk Synchronous Parallel computing framework for massive scientific computations such as matrix and graph and network algorithms Copyright 2014 Simplilearn All rights reserved Work Management Operations and Development Components Following are the components related to work management operations and development Oozie Workflow or coordination system to manage Hadoop jobs ZooKeeper Centralized service for maintaining configuration information naming and providing distributed synchronization and group services Development Operations Work management Ambari A web based tool for provisioning managing and monitoring Apache Hadoop clusters Vaidya A performance diagnostic tool for MapReduce jobs BigTop A project for developing the packaging and tests and for ensuring interoperability among Apache Hadoop related projects Whirr A set of libraries for running cloud services for example running Hadoop clusters on EC2 Crunch A framework for writing testing and running MapReduce pipelines MRUnit A Java library that helps developers unit test Hadoop MapReduce jobs HDT Hadoop Development Tools HDT comprise Eclipse based tools for developing applications on the Hadoop platform Copyright 2014 Simplilearn All rights reserved Security Components Following are the security components of the Hadoop ecosystem Knox A system that provides a single point of secure access for Apache Hadoop clusters Sentry A system for providing finegrained role based authorization to both data and metadata stored on an Apache Hadoop cluster Copyright 2014 Simplilearn All rights reserved Data Transfer Components Following are the data transfer tools of the Hadoop ecosystem Flume Sqoop Chukwa Kafka Falcon A distributed reliable service available for efficiently collecting aggregating and moving large amounts of log data A tool designed for efficiently transferring bulk data between Apache Hadoop and structured data stores namely relational databases An open source data collection system for monitoring large distributed systems A distributed publish subscribe messaging system Configuration of data motion with replication lifecycle management lineage and traceability Copyright 2014 Simplilearn All rights reserved Components Related to Data Interactions Following are the components related to data interaction A platform for analyzing large data sets consisting of a high level language for expressing data analysis programs coupled with infrastructure for evaluating these programs Pig Hive HCatalog Tez Gora A data warehouse system that facilitates easy data summarization ad hoc queries and analysis of large datasets stored in Hadoop compatible file systems A table and storage management service for data created using Apache Hadoop A generic application framework that can be used to process complex data processing task DAGs Directed Acyclic Graphs it runs natively on Apache Hadoop YARN A framework for in memory data model and persistence with MapReduce support Copyright 2014 Simplilearn All rights reserved Components Related to Data Interactions Following are the components related to data interaction Powers a stack of high level tools including Spark SQL MLlib for machine learning GraphX and Spark Streaming Spark Storm A system to process unbounded streams of data for real time processing A query processing and optimization system for large scale distributed data analysis which is built on top of Apache Hadoop Hama and Spark MRQL Tajo Data warehouse system for Apache Hadoop which supports low latency scalable adhoc queries online aggregation and ETL extract transform load process Copyright 2014 Simplilearn All rights reserved Components Related to Analytics and Intelligence Following are the components related to analytics and intelligence Mahout A scalable machine learning and data mining algorithm library Supports the following o Recommendation mining o Clustering o Classification o Frequent itemset mining Drill A distributed system for interactive analysis of largescale datasets Comprises the following o User interface CLI and REST o Pluggable query language o Pluggable data source Command Line Interface Representational State Transfer Copyright 2014 Simplilearn All rights reserved Search Frameworks Following are the Search frameworks of the Hadoop ecosystem Lucene Open source search software including Java based indexing and search component Lucene Core and high performance search server component Solr Blur

View Full Document