HIVE developed by Facebook in 2011, is a data warehouse infrastructure tool to process structured data in Hadoop. It’s a framework built on top of Hadoop to summarize Big Data, and makes querying and analyzing easy.
Without learning SQL, you cannot work
on HIVE. Facebook later sold the HIVE project to Apache became “Apache Hive”
launched in competition to Apache PIG developed by Yahoo.
Data present in the Part File of
Hadoop are converted in the form of Tables [Rows & Columns] by HIVE which
later processes the Tabular data using SQL popularly known as ‘HQL’ (Hive Query Language) which
follows 1992 syntax.
Whatever we do in HIVE, we see the
reflection in HDFS, but not vice versa. Hive will not process XML /JSON / RC
/Sequence data. Hive will allow only AVRO / PARQUET / ORC file formats.
Hive doesn’t have storage only hdfs
has storage. Hive provides the schema for the structured data inside the hdfs to
create table and do the analysis. Hive process is always slow.
From 2020 onwards, all Cloudera Hadoop
version (CDH) clusters are migrating to CDP (Cloudera Data Platform) as
Cloudera merged with Hortonworks (costly because of TEZ Map Reduce).
Earlier Map Reduce process is slower
because of repeated Input & Output operations.
Internal
Architecture of Apache Hive:
Hive Consists of Mainly 3 core parts:
- Hive Clients
- Hive Services
- Hive Storage and Computing
- Hive Clients:
Hive provides different drivers for communication with a different type
of applications. They include Thrift application to execute easy
hive commands which are available for python, ruby, C++, and drivers.
These client application benefits for executing queries on the hive. Hive has three types of client categorization: thrift clients, JDBC, and ODBC clients
2. Hive Services:
- Command-line interface (User Interface): It enables interaction between the user and the hive, a default shell. It provides a GUI for executing hive command line and hive insight. We can also use web interfaces (HWI) to submit the queries and interactions with a web browser.
- Hive Driver: It receives queries from different sources and clients like thrift server and does store and fetching on ODBC and JDBC driver which are automatically connected to the hive. This component does semantic analysis on seeing the tables from the metastore, which parses a query. The driver takes the help of compiler and performs functions like a parser, Planner, Execution of MapReduce jobs and optimizer.
-
- Compiler: Parsing and semantic process of the query is done by the compiler. It converts the query into an abstract syntax tree and again back into DAG for compatibility. The optimizer, in turn, splits the available tasks. The job of the executor is to run the tasks and monitoring the pipeline schedule of the tasks.
- Metastore: It acts as a central repository to store all the structured information of metadata also it’s an important aspect part for the hive as it has information like tables and partitioning details and the storage of HDFS files. In other words, we shall say Metastore acts as a namespace for tables. Metastore is considered to be a separate database that is shared by other components too. Metastore has two pieces called service and backlog storage.
- Query results and data loaded in the tables are going to be stored in Hadoop cluster on HDFS.
- Execution Engine: All the queries are processed by an execution engine. A DAG stage plans are executed by the engine and help in managing the dependencies between the available stages and execute them on a correct component.
We have 4 Clusters:
a) DEV à 70 Node
b) SIT à 70 Node
c) UAT à 160 Node
d) PROD à 180 Node
Just type ‘hive’ in cloudera to enter inside.
Hadoop
will only understand Map Reduce code regardless whatever the frameworks (PIG /
HIVE / SPARK) resides on it.
Once Data get stored in the Data Warehouse, it will get connected
to reporting tools to visualize the data. That’s the reason Hive is OLAP
(OnLine Analytical Processing). No DML operations in Hive.
Connect to Hive Server 2 through Beeline.
From the above diagram, we can have a glimpse of
data flow in the hive with the Hadoop system.
Steps include:
1.
Execute the Query from User Interface (UI).
2.
Query
Execution Plan: Driver is interacting with Compiler for
getting the plan.
3.
Compiler creates the Plan first by
communicating Metadata request from Meta Store.
4.
Meta Store sends metadata information back to
compiler.
5.
Compiler communicates with Driver with
proposed plan to execute Query.
6.
Driver sending Execution Plans to Execution
Engine (EE).
7.
Execution Engine acts as a bridge between Hive
and Hadoop HDFS to process the Query.
8. EE should first contacts Name Node and then to Data nodes to get the values stored in tables.
9. Execution Engine (EE) in turn communicates with Hadoop daemons such as Name node, Data nodes, and job tracker to execute the query on top of Hadoop file system.
10. Execution Engine (EE) communicates bi-directionally with Meta store present in Hive to perform DDL (Data Definition Language) operations. Here DDL operations like CREATE, DROP and ALTERING tables and databases are done. Meta store will store information about database name, table names and column names only. It will fetch data related to query mentioned.
Hive is all about Partitions and Bucketing of Table.
No comments:
Post a Comment