Labels

Thursday, March 17, 2022

Introduction to HIVE Architecture (Class -7)

HIVE developed by Facebook in 2011, is a data warehouse infrastructure tool to process structured data in Hadoop. It’s a framework built on top of Hadoop to summarize Big Data, and makes querying and analyzing easy.

Without learning SQL, you cannot work on HIVE. Facebook later sold the HIVE project to Apache became “Apache Hive” launched in competition to Apache PIG developed by Yahoo.

Data present in the Part File of Hadoop are converted in the form of Tables [Rows & Columns] by HIVE which later processes the Tabular data using SQL popularly known as ‘HQL’ (Hive Query Language) which follows 1992 syntax.

Whatever we do in HIVE, we see the reflection in HDFS, but not vice versa. Hive will not process XML /JSON / RC /Sequence data. Hive will allow only AVRO / PARQUET / ORC file formats.

Hive doesn’t have storage only hdfs has storage. Hive provides the schema for the structured data inside the hdfs to create table and do the analysis. Hive process is always slow.

From 2020 onwards, all Cloudera Hadoop version (CDH) clusters are migrating to CDP (Cloudera Data Platform) as Cloudera merged with Hortonworks (costly because of TEZ Map Reduce).

Earlier Map Reduce process is slower because of repeated Input & Output operations.

Internal Architecture of Apache Hive:



Hive Consists of Mainly 3 core parts:

  1. Hive Clients
  2. Hive Services
  3. Hive Storage and Computing
  1. Hive Clients:

         Hive provides different drivers for communication with a different type of applications. They include Thrift application to execute easy hive commands which are available for python, ruby, C++, and drivers. 

These client application benefits for executing queries on the hive. Hive has three types of client categorization: thrift clients, JDBC, and ODBC clients

2. Hive Services:

               
          To process all the queries hive has various services. All the functions are easily defined by the user in the hive. Let’s see all those services in brief:

  • Command-line interface (User Interface): It enables interaction between the user and the hive, a default shell. It provides a GUI for executing hive command line and hive insight. We can also use web interfaces (HWI) to submit the queries and interactions with a web browser.

  •  Hive Driver: It receives queries from different sources and clients like thrift server and does store and fetching on ODBC and JDBC driver which are automatically connected to the hive. This component does semantic analysis on seeing the tables from the metastore, which parses a query. The driver takes the help of compiler and performs functions like a parser, Planner, Execution of MapReduce jobs and optimizer.
  •  
  • CompilerParsing and semantic process of the query is done by the compiler. It converts the query into an abstract syntax tree and again back into DAG for compatibility. The optimizer, in turn, splits the available tasks. The job of the executor is to run the tasks and monitoring the pipeline schedule of the tasks.
3. Hive Storage and Computing:

      Hive services such as Meta store, File system, and Job Client in turn communicates with Hive storage and performs the following actions:

  • Metastore: It acts as a central repository to store all the structured information of metadata also it’s an important aspect part for the hive as it has information like tables and partitioning details and the storage of HDFS files. In other words, we shall say Metastore acts as a namespace for tables.  Metastore is considered to be a separate database that is shared by other components too. Metastore has two pieces called service and backlog storage.

  • Query results and data loaded in the tables are going to be stored in Hadoop cluster on HDFS.
  • Execution Engine: All the queries are processed by an execution engine. A DAG stage plans are executed by the engine and help in managing the dependencies between the available stages and execute them on a correct component.
The hive data model is structured into Partitions, buckets, tables. All these can be filtered, have partition keys and to evaluate the query. Hive query works on the Hadoop framework, not on the traditional database. Hive server is an interface between a remote client queries to the hive. The execution engine is completely embedded in a hive server. You could find hive application in machine learning, business intelligence in the detection process.


We have 4 Clusters:

a)  DEV à 70 Node

b)  SIT à 70 Node

c) UAT à 160 Node

d) PROD à 180 Node

Just type ‘hive’ in cloudera to enter inside.


Hadoop will only understand Map Reduce code regardless whatever the frameworks (PIG / HIVE / SPARK) resides on it.

Once Data get stored in the Data Warehouse, it will get connected to reporting tools to visualize the data. That’s the reason Hive is OLAP (OnLine Analytical Processing). No DML operations in Hive.

Connect to Hive Server 2 through Beeline.


Whatever the command scripts we give in Hive, will create changes in HDFS.          


From the above diagram, we can have a glimpse of data flow in the hive with the Hadoop system.

Steps include:

1.       Execute the Query from User Interface (UI).

2.       Query Execution Plan: Driver is interacting with Compiler for getting the plan.

3.       Compiler creates the Plan first by communicating Metadata request from Meta Store.

4.       Meta Store sends metadata information back to compiler.

5.       Compiler communicates with Driver with proposed plan to execute Query.

6.       Driver sending Execution Plans to Execution Engine (EE).

7.       Execution Engine acts as a bridge between Hive and Hadoop HDFS to process the Query.

8.       EE should first contacts Name Node and then to Data nodes to get the values stored in tables.

9. Execution Engine (EE) in turn communicates with Hadoop daemons such as Name node, Data nodes, and job tracker to execute the query on top of Hadoop file system.

10. Execution Engine (EE) communicates bi-directionally with Meta store present in Hive to perform DDL (Data Definition Language) operations. Here DDL operations like CREATE, DROP and ALTERING tables and databases are done. Meta store will store information about database name, table names and column names only. It will fetch data related to query mentioned.

Hive is all about Partitions and Bucketing of Table.

No comments:

Post a Comment