Two types of Partitions in Hive - Static and Dynamic, both operate on the same basic principles of Partitioning. Once the partitions are created, the tables won’t have any difference like static and dynamic partitions. All partitions are treated and one and the same.
Wednesday, March 30, 2022
Tuesday, March 29, 2022
Partitioning vs Bucketing - Key Differences in Hive (Class -19)
For data storage, Hive has four main components for organizing data: databases, tables, partitions and buckets.
Internal vs External Tables - Key Differences in Hive (Class -18)
The main difference between an internal table and an external table in Hive:
An Internal table is also called a managed table, meaning it’s “managed” by Hive. When you 'DROP' the Internal table, Hive will delete both the schema/definition and the Metadata and it will also physically delete the data/rows (truncation) associated with that table from the Hadoop Distributed File System (HDFS).
An External table is not “managed” by Hive. When you drop an external table, the schema/table definition is deleted and gone, but the data/rows associated with it are left alone. The table’s rows are not deleted.
Wednesday, March 23, 2022
Performance Optimization Techniques in Hive (Class -17)
There are several types of Hive Query Optimization techniques are
available to optimize hive queries to execute them faster on
Hadoop Cluster.
Hive is a query
language which is similar to SQL built on Hadoop ecosystem can process Penta bytes of data.
1) 1) Partitions
1) 2) Bucketing
1) 3) Tez-Execution
1) 4) CBO à
Cost based optimization
1) 5) Vectorization
JOINS in Hive (Class -16)
HIVE JOIN is a clause that is used for combining specific fields from two tables by using values common to each one. It is used to combine records from two or more tables in the database.
In other words, to combine records from two or more tables in the
database we use JOIN clause. However, it is more or less similar to SQL JOIN.
Also, we use it to combine rows from multiple tables.
Following is a syntax
of Hive Join – HiveQL Select Clause.
join_table:
table_reference JOIN table_factor [join_condition]
| table_reference {LEFT|RIGHT|FULL} [OUTER] JOIN
table_reference
join_condition
| table_reference LEFT SEMI JOIN table_reference join_condition
| table_reference CROSS JOIN table_reference [join_condition]
Basically, there are 4 types of Hive Join. Such as:
1.
Inner join in Hive
2. Left Outer Join in Hive
3.
Right Outer Join in Hive
4.
Full Outer Join in Hive
How to execute SQL script through Hive? (Class -15)
The Hive scripts are used to collectively execute a series of SQL Commands. In other words, multiple SQL statements can be grouped and executed all at once. This helps to reduce the time and effort expended in manually drafting and executing any command.
a) Create
a Database à
create db name;
b) Create
a table
c) Load
the data
d) Select
the data
We need to do above steps in script.
Go to Cloudera / Cloudxlab:
How to handle the Incremental Data in Hive? (Class -14)
Incremental data is Dynamic data which is updated data receives on day-to-day basis.
We use
VIEWS to handle the incremental data.
SerDe properties of Hive (Class -13)
SerDe means Serializer and Deserializer. Hive uses SerDe and FileFormat to read and write table rows. Main use of SerDe interface is for IO operations.
Process of
compressing the huge Data and serialize it to binary value format before
transfer through HDFS though Network is called “Serialization”.
Data
transferred through Serialization process finally reach the HDFS in the form of
AVRO file format / Parquet file format.
Binary
values of data in HDFS are converted to Human readable format through Deserialization process.
Tuesday, March 22, 2022
ACID Transactions in Hive (Class -12)
Hive supports all ACID properties which enable us to use transactions, create transactional tables, and run queries like Insert, Update, and Delete on tables.
A à Atomicity (an operation either succeeds completely or fails, it does not leave
partial data)
C à Consistency (once an application performs an operation the results of that operation
are visible to it in every subsequent operation),
I à Isolated (an incomplete operation by one user doesn’t cause unexpected side
effects for other users)
D à Durability (once an operation is complete it will be preserved even in the face of
machine or system failure).
Acid Properties can be done on Internal tables not External.
We can do
DML (Insert, Delete & Update) operations in Hive if version is greater than
> 0.13.
To perform DML Operations, Certain conditions need to be followed in ACID properties:
a) Bucketed table
Create Hive table in bucketed format following ORC properties:
See above
table is created in 3 Buckets form stored in ORC file format and transactional
property is true.
a)
Now do the Insert operations,
Now Update record in Hive:
Now Delete
a file from Hive:
Go to Default path whether in Cloudera / Cloudxlab:
For every transaction, One Delta File will be created.
Sunday, March 20, 2022
Hive Metastore (Class -11)
Central Repository location where Hive stores all the Schemas [Column names, Data types, etc.] are called ‘Metastore’ (Derby). It stores metadata for Hive tables (like their schema and location) and partitions in a relational database. It provides client access to this information by using metastore service API.
By default, Hive uses
a built-in Derby SQL server.
Now when you run your
Hive query and you are using the default Derby database, you will find that
your current directory now contains a new sub-directory, metastore_db.
The default value of
this property is jdbc:derby:;databaseName=metastore_db;create=true. This value specifies that you will be using the embedded
Derby as your Hive metastore, and the location of the metastore is metastore_db.
We can also configure
the directory for the Hive to store table information. By default, the location
of the warehouse is file:///user/hive/warehouse and we can also use the hive-site.xml file for the local or remote metastore.
MySQL
server à Metastore
(Hive table schema details) where most of the projects use.
Friday, March 18, 2022
Bucketing in Hive (Class -10)
The Bucketing in Hive is a data organizing technique which is quite similar to partitioning in Hive with an added functionality that it divides large datasets into more manageable parts known as Buckets, when the implementation of partitioning becomes difficult.
The
concept of bucketing is based on the hashing technique.
Bucketing
will create Files while Partition will create Folders.
Bucketing
works on the principle of Modulo. Here, modules of current column value
and the number of required buckets is calculated (let say, F(x) % 2).
Partitions in Hive (Class -9)
Apache Hive allows us to organize the table into multiple partitions where we can group the same kind of data together. It is used for distributing the load horizontally. These smaller logical tables are not visible to users and users still access the data from just one table.
Hive organizes tables into Partitions à way of dividing a table into related parts based on the values of
particular columns like date, city, and department.
Each table in the hive can have one or more partition keys to identify a particular partition. Using partition it is easy to do queries on slices of the data.
Partitioning in Hive distributes execution load horizontally.
In partition faster execution of queries with the low volume of data takes place.
When you load the data into the partition table, Hive internally splits the records based on the partition key and stores each partition data into a sub-directory of tables directory on HDFS. The name of the directory would be partition key and it’s value.
·
Static Partitioning
·
Dynamic Partitioning
Thursday, March 17, 2022
How to create a Table in Hive? (Class -8)
A table in Hive is a set of data that uses a schema to sort the data by given identifiers.
The way of creating tables in the hive is very much similar to the way we create tables in SQL.
We can perform the various operations with these tables like Joins, Filtering, etc.
Introduction to HIVE Architecture (Class -7)
HIVE developed by Facebook in 2011, is a data warehouse infrastructure tool to process structured data in Hadoop. It’s a framework built on top of Hadoop to summarize Big Data, and makes querying and analyzing easy.
Without learning SQL, you cannot work
on HIVE. Facebook later sold the HIVE project to Apache became “Apache Hive”
launched in competition to Apache PIG developed by Yahoo.
Data present in the Part File of
Hadoop are converted in the form of Tables [Rows & Columns] by HIVE which
later processes the Tabular data using SQL popularly known as ‘HQL’ (Hive Query Language) which
follows 1992 syntax.
Whatever we do in HIVE, we see the
reflection in HDFS, but not vice versa. Hive will not process XML /JSON / RC
/Sequence data. Hive will allow only AVRO / PARQUET / ORC file formats.