DOC PREVIEW
UT Dallas CS 6350 - BigDataHadoop_PPT_Lesson08

This preview shows page 1-2-3-4-5-35-36-37-38-39-71-72-73-74-75 out of 75 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 75 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 75 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 75 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 75 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 75 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 75 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 75 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 75 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 75 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 75 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 75 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 75 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 75 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 75 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 75 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 75 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Copyright 2014, Simplilearn, All rights reserved. Copyright 2014, Simplilearn, All rights reserved. Lesson 8—Hive Big Data and Hadoop DeveloperCopyright 2014, Simplilearn, All rights reserved. ● Describe Hive and its importance ● Explain Hive architecture and its various components ● Identify the steps to install and configure Hive ● Describe the basics of Hive programming By the end of this lesson, you will be able to: ObjectivesCopyright 2014, Simplilearn, All rights reserved. The table shows the problems related to data inflow and expressiveness, and the solutions adapted to address the need for an additional data warehousing system: Need for Additional Data Warehousing System Problem Solution Extensive data inflow The Hadoop experiment: ● uses Hadoop Distributed File System (HDFS) ● has a scalable/accessible architecture Data lacked expressiveness and it was difficult to develop a MapReduce program to express the data Using Hive Data warehouseCopyright 2014, Simplilearn, All rights reserved. Hive can be defined as follows: Hive—Introduction Hive is a data warehouse system for Hadoop that facilitates ad-hoc queries and the analysis of large data sets stored in Hadoop. Following are the facts related to Hive: ● It provides a SQL-like language called HiveQL (HQL). Due to its SQL-like interface, Hive is a popular choice for Hadoop analytics. ● It provides massive scale-out and fault tolerance capabilities for data storage and processing of commodity hardware. ● Relying on MapReduce for execution, Hive is batch-oriented and has high latency for query execution. Image source: hive.apache.orgCopyright 2014, Simplilearn, All rights reserved. Hive is a system for managing and querying unstructured data into a structured format. It uses the concept of: ● MapReduce for execution; and ● HDFS for storage and retrieval of data. Hive—Characteristics Principles of Hive Hive commands are similar to that of SQL which is a data warehousing tool similar to Hive Extensibility (pluggable MapReduce scripts in the language of your choice—rich, user-defined data types and user-defined functions) Interoperability (extensible framework to support different files and data formats) Performance is better in Hive since Hive engine uses the best in-built script to reduce the execution time while enabling high outputCopyright 2014, Simplilearn, All rights reserved. The image illustrates the architecture of the Hive system. It also displays the role of Hive and Hadoop in the development process. System Architecture and Components of HiveCopyright 2014, Simplilearn, All rights reserved. Metastore is the component that stores the system catalog and metadata about tables, columns, partitions, and so on. Metadata is stored in a traditional RDBMS format. Apache Hive uses Derby database by default. Any JDBC compliant database like MySQL can be used for metastore. MetastoreCopyright 2014, Simplilearn, All rights reserved. The key attributes that should be configured for Hive metastore are given below: Metastore ConfigurationCopyright 2014, Simplilearn, All rights reserved. The hive-site.xml file is used to configure the metastore. A template for the file is displayed here. Metastore Configuration—TemplateCopyright 2014, Simplilearn, All rights reserved. Driver is the component that: ● manages the lifecycle of a Hive Query Language (HiveQL) statement as it moves through Hive; and ● maintains a session handle and any session statistics. DriverCopyright 2014, Simplilearn, All rights reserved. Query compiler compiles HiveQL into a Directed Acyclic Graph (DAG) of MapReduce tasks. Query CompilerCopyright 2014, Simplilearn, All rights reserved. Query optimizer: ● consists of a chain of transformations, so that the operator DAG resulting from one transformation is passed as an input to the next transformation. ● performs tasks—column pruning, partition pruning, and repartitioning of data. Query OptimizerCopyright 2014, Simplilearn, All rights reserved. The execution engine: ● executes the tasks produced by the compiler in proper dependency order. ● interacts with the underlying Hadoop instance to ensure perfect synchronization with Hadoop services. Execution EngineCopyright 2014, Simplilearn, All rights reserved. Hive Server provides a thrift interface and a Java Database Connectivity/Open Database Connectivity (JDBC/ODBC) server. It enables the integration of Hive with other applications. Hive ServerCopyright 2014, Simplilearn, All rights reserved. A developer uses the client component to perform development in Hive. The client component includes the Command Line Interface (CLI), the web user interface (UI), and the JDBC/ODBC driver. Client ComponentsCopyright 2014, Simplilearn, All rights reserved. Hive Query Language (HQL) is the query language for Hive engine. Hive supports the basic SQL queries such as: ● From clause sub-query; ● ANSI JOIN (only equi-join); ● multi-table insert; ● multi group-by; ● sampling; and ● objects traversal. Basics of The Hive Query Language ! HQL provides support to pluggable MapReduce scripts using TRANSFORM.Copyright 2014, Simplilearn, All rights reserved. Tables in Hive are analogous to tables in relational databases. A Hive table logically comprises the data being stored and the associated meta data. Each table has a corresponding directory in HDFS. Data Model—Tables Two types of tables in Hive External tables Managed tablesCopyright 2014, Simplilearn, All rights reserved. The command used to create a table in Hive is: The HDFS directory of the table: /apps/hive/warehouse/t1 Data Model—Tables (contd.) CREATE TABLE t1(ds string, ctry float, li list<map<string, struct<p1:int, p2:int>>); Data modelCopyright 2014, Simplilearn, All rights reserved. Listed below are the key considerations in an external table: ● Points can be stored in existing data directories in HDFS. ● Tables and partitions can be created. ● In an external table, data is available in Hive-compatible format. ● On dropping the external table, only the metadata drops. The command used to create an external table is: Data Model—External Tables CREATE EXTERNAL TABLE test_extern(c1 string, c2 int) LOCATION '/user/mytables/mydata';Copyright 2014, Simplilearn, All rights reserved Hadoop Configuration The data types in Hive are as follows: Data Types in Hive Data Types in Hive Primitive types Complex types


View Full Document

UT Dallas CS 6350 - BigDataHadoop_PPT_Lesson08

Documents in this Course
HW3

HW3

5 pages

NOSQL-CAP

NOSQL-CAP

23 pages

BigTable

BigTable

39 pages

HW3

HW3

5 pages

Load more
Download BigDataHadoop_PPT_Lesson08
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view BigDataHadoop_PPT_Lesson08 and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view BigDataHadoop_PPT_Lesson08 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?