UT Dallas CS 6350 - BigDataHadoop_PPT_Lesson12 - D3105857

Home> Schools> University of Texas at Dallas> Computer Science (CS) > CS 6350> BigDataHadoop_PPT_Lesson12

DOC PREVIEW

UT Dallas CS 6350 - BigDataHadoop_PPT_Lesson12

School name University of Texas at Dallas

Course Cs 6350- Big Data Management and Analytics

Pages 47

This preview shows page 1-2-3-22-23-24-45-46-47 out of 47 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 47 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 47 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 47 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 47 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 47 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 47 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 47 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 47 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 47 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 47 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Copyright 2014, Simplilearn, All rights reserved. Copyright 2014, Simplilearn, All rights reserved. Big Data and Hadoop Developer Lesson 12—Ecosystem and Its ComponentsCopyright 2014, Simplilearn, All rights reserved. Objectives ● Explain the Hadoop ecosystem structure ● Describe the different components of Hadoop ecosystem and their roles By the end of this lesson, you will be able to: ObjectivesCopyright 2014, Simplilearn, All rights reserved. Apache Hadoop Ecosystem The image displays the Hadoop ecosystem components as part of Apache Software Foundation projects. The components are categorized into file system and data store, serialization, job execution, and others as shown on the image. Image source: hadoopshere.comCopyright 2014, Simplilearn, All rights reserved. The file system component of the Hadoop ecosystem is: File System Component Hadoop Distributed File System (HDFS) ● A distributed file system that provides high-throughput access ● Supported by NameNode, Secondary NameNode, and DataNodes in the Hadoop clusterCopyright 2014, Simplilearn, All rights reserved. Following are the data store components of the Hadoop ecosystem: Data Store Components Distributed, scalable, and big data store HBase Highly scalable, eventually consistent, distributed, and structured key-value store Cassandra Sorted, distributed key-value data storage and retrieval system AccumuloCopyright 2014, Simplilearn, All rights reserved. Following are the serialization components of the Hadoop ecosystem: Serialization Components A column file format to permit compatible, independent implementations that read and/or write files in this format Data serialization system Avro Trevni Framework for scalable cross-language services development ThriftCopyright 2014, Simplilearn, All rights reserved. Following are the job execution components of the Hadoop ecosystem: Job Execution Components Framework that facilitates the writing of arbitrary distributed processing frameworks and applications Framework which performs distributed data processing, and comprises the JobTracker, the TaskTracker and the JobHistoryServer MapReduce YARN Pure BSP (Bulk Synchronous Parallel) computing framework for massive scientific computations such as matrix and graph and network algorithms HamaCopyright 2014, Simplilearn, All rights reserved. Following are the components related to work management, operations, and development: Work Management, Operations, and Development Components Work management ● Oozie: Workflow or coordination system to manage Hadoop jobs ● ZooKeeper: Centralized service for maintaining configuration information, naming, and providing distributed synchronization and group services Operations ● Ambari: A web-based tool for provisioning, managing, and monitoring Apache Hadoop clusters ● Vaidya: A performance diagnostic tool for MapReduce jobs ● BigTop: A project for developing the packaging and tests, and for ensuring interoperability among Apache Hadoop related projects ● Whirr: A set of libraries for running cloud services, for example, running Hadoop clusters on EC2 Development ● Crunch: A framework for writing, testing, and running MapReduce pipelines ● MRUnit: A Java library that helps developers unit test Hadoop MapReduce jobs ● HDT: Hadoop Development Tools (HDT) comprise Eclipse based tools for developing applications on the Hadoop platformCopyright 2014, Simplilearn, All rights reserved. Following are the security components of the Hadoop ecosystem: Security Components Sentry A system that provides a single point of secure access for Apache Hadoop clusters A system for providing fine-grained role based authorization to both data and metadata stored on an Apache Hadoop cluster KnoxCopyright 2014, Simplilearn, All rights reserved. Following are the data transfer tools of the Hadoop ecosystem: Data Transfer Components A tool designed for efficiently transferring bulk data between Apache Hadoop and structured data stores, namely, relational databases A distributed, reliable service available for efficiently collecting, aggregating, and moving large amounts of log data Flume Sqoop An open source data collection system for monitoring large distributed systems Chukwa A distributed publish-subscribe messaging system Kafka Configuration of data motion with replication, lifecycle management, lineage, and traceability FalconCopyright 2014, Simplilearn, All rights reserved. Following are the components related to data interaction: Components Related to Data Interactions A data warehouse system that facilitates easy data summarization, ad-hoc queries, and analysis of large datasets stored in Hadoop compatible file systems A platform for analyzing large data sets consisting of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs Pig Hive A table and storage management service for data created using Apache Hadoop HCatalog A generic application framework that can be used to process complex data-processing task DAGs (Directed Acyclic Graphs); it runs natively on Apache Hadoop YARN Tez A framework for in-memory data model and persistence with MapReduce support GoraCopyright 2014, Simplilearn, All rights reserved. Following are the components related to data interaction: Components Related to Data Interactions A system to process unbounded streams of data for real-time processing Powers a stack of high-level tools including Spark SQL, MLlib for machine learning, GraphX, and Spark Streaming Spark Storm A query processing and optimization system for large-scale, distributed data analysis, which is built on top of Apache Hadoop, Hama, and Spark MRQL Data warehouse system for Apache Hadoop, which supports low-latency, scalable ad-hoc queries, online aggregation, and ETL (extract-transform-load process) TajoCopyright 2014, Simplilearn, All rights reserved. Following are the components related to analytics and intelligence: Components Related to Analytics and Intelligence Drill ● A scalable machine learning and data mining algorithm library ● Supports the following: o Recommendation mining o Clustering o Classification o Frequent itemset mining ● A distributed system for interactive analysis of large-scale datasets ● Comprises the following: o User interface (CLI* and REST**) o Pluggable query language o Pluggable data source Mahout *Command Line Interface, **Representational State

View Full Document