UT Dallas CS 6350 - BigDataHadoop_PPT_Lesson08 - D3105853

Home> Schools> University of Texas at Dallas> Computer Science (CS) > CS 6350> BigDataHadoop_PPT_Lesson08

DOC PREVIEW

UT Dallas CS 6350 - BigDataHadoop_PPT_Lesson08

School name University of Texas at Dallas

Course Cs 6350- Big Data Management and Analytics

Pages 75

This preview shows page 1-2-3-4-5-35-36-37-38-39-71-72-73-74-75 out of 75 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 75 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Big Data and Hadoop Developer Lesson 8 Hive Copyright 2014 Simplilearn All rights reserved Copyright 2014 Simplilearn All rights reserved Objectives By the end of this lesson you will be able to Describe Hive and its importance Explain Hive architecture and its various components Identify the steps to install and configure Hive Describe the basics of Hive programming Copyright 2014 Simplilearn All rights reserved Need for Additional Data Warehousing System The table shows the problems related to data inflow and expressiveness and the solutions adapted to address the need for an additional data warehousing system Problem Solution Extensive data inflow The Hadoop experiment uses Hadoop Distributed File System HDFS has a scalable accessible architecture Data lacked expressiveness and it was difficult to develop a MapReduce program to express the data Using Hive Data warehouse Copyright 2014 Simplilearn All rights reserved Hive Introduction Hive can be defined as follows Hive is a data warehouse system for Hadoop that facilitates ad hoc queries and the analysis of large data sets stored in Hadoop Following are the facts related to Hive It provides a SQL like language called HiveQL HQL Due to its SQL like interface Hive is a popular choice for Hadoop analytics It provides massive scale out and fault tolerance capabilities for data storage and processing of commodity hardware Image source hive apache org Relying on MapReduce for execution Hive is batch oriented and has high latency for query execution Copyright 2014 Simplilearn All rights reserved Hive Characteristics Hive is a system for managing and querying unstructured data into a structured format It uses the concept of MapReduce for execution and HDFS for storage and retrieval of data Performance is better in Hive since Hive engine uses the best in built script to reduce the execution time while enabling high output Hive commands are similar to that of SQL which is a data warehousing tool similar to Hive Principles of Hive Interoperability extensible framework to support different files and data formats Extensibility pluggable MapReduce scripts in the language of your choice rich userdefined data types and user defined functions Copyright 2014 Simplilearn All rights reserved System Architecture and Components of Hive The image illustrates the architecture of the Hive system It also displays the role of Hive and Hadoop in the development process Copyright 2014 Simplilearn All rights reserved Metastore Metastore is the component that stores the system catalog and metadata about tables columns partitions and so on Metadata is stored in a traditional RDBMS format Apache Hive uses Derby database by default Any JDBC compliant database like MySQL can be used for metastore Copyright 2014 Simplilearn All rights reserved Metastore Configuration The key attributes that should be configured for Hive metastore are given below Copyright 2014 Simplilearn All rights reserved Metastore Configuration Template The hive site xml file is used to configure the metastore A template for the file is displayed here Copyright 2014 Simplilearn All rights reserved Driver Driver is the component that manages the lifecycle of a Hive Query Language HiveQL statement as it moves through Hive and maintains a session handle and any session statistics Copyright 2014 Simplilearn All rights reserved Query Compiler Query compiler compiles HiveQL into a Directed Acyclic Graph DAG of MapReduce tasks Copyright 2014 Simplilearn All rights reserved Query Optimizer Query optimizer consists of a chain of transformations so that the operator DAG resulting from one transformation is passed as an input to the next transformation performs tasks column pruning partition pruning and repartitioning of data Copyright 2014 Simplilearn All rights reserved Execution Engine The execution engine executes the tasks produced by the compiler in proper dependency order interacts with the underlying Hadoop instance to ensure perfect synchronization with Hadoop services Copyright 2014 Simplilearn All rights reserved Hive Server Hive Server provides a thrift interface and a Java Database Connectivity Open Database Connectivity JDBC ODBC server It enables the integration of Hive with other applications Copyright 2014 Simplilearn All rights reserved Client Components A developer uses the client component to perform development in Hive The client component includes the Command Line Interface CLI the web user interface UI and the JDBC ODBC driver Copyright 2014 Simplilearn All rights reserved Basics of The Hive Query Language Hive Query Language HQL is the query language for Hive engine Hive supports the basic SQL queries such as From clause sub query ANSI JOIN only equi join multi table insert multi group by sampling and objects traversal HQL provides support to pluggable MapReduce scripts using TRANSFORM Copyright 2014 Simplilearn All rights reserved Data Model Tables Tables in Hive are analogous to tables in relational databases A Hive table logically comprises the data being stored and the associated meta data Each table has a corresponding directory in HDFS Two types of tables in Hive Managed tables External tables Copyright 2014 Simplilearn All rights reserved Data Model Tables contd The command used to create a table in Hive is CREATE TABLE t1 ds string ctry float li list map string struct p1 int p2 int The HDFS directory of the table apps hive warehouse t1 Data model Copyright 2014 Simplilearn All rights reserved Data Model External Tables Listed below are the key considerations in an external table Points can be stored in existing data directories in HDFS Tables and partitions can be created In an external table data is available in Hive compatible format On dropping the external table only the metadata drops The command used to create an external table is CREATE EXTERNAL TABLE test extern c1 string c2 int LOCATION user mytables mydata Copyright 2014 Simplilearn All rights reserved Hadoop Configuration Data Types in Hive The data types in Hive are as follows Data Types in Hive Primitive types Complex types User defined types Integers TINYINT Structs a INT b INT Structures with attributes SMALLINT INT and Maps M group Attributes can be of any BIGINT Arrays a b c A 1 Boolean BOOLEAN Floating point numbers type returns b FLOAT and DOUBLE String STRING Copyright 2014 Simplilearn All rights reserved Data Model Partitions Partitions are analogous to dense indexes on columns Following are the

View Full Document