SK DATA SHARE

Wednesday, March 23, 2022

JOINS in Hive (Class -16)

HIVE JOIN is a clause that is used for combining specific fields from two tables by using values common to each one. It is used to combine records from two or more tables in the database.

In other words, to combine records from two or more tables in the database we use JOIN clause. However, it is more or less similar to SQL JOIN. Also, we use it to combine rows from multiple tables.

Following is a syntax of Hive Join – HiveQL Select Clause.
join_table:
  table_reference JOIN table_factor [join_condition]
  | table_reference {LEFT|RIGHT|FULL} [OUTER] JOIN table_reference
  join_condition
  | table_reference LEFT SEMI JOIN table_reference join_condition
  | table_reference CROSS JOIN table_reference [join_condition]

Basically, there are 4 types of Hive Join. Such as:

1. Inner join in Hive

2. Left Outer Join in Hive

3. Right Outer Join in Hive

4. Full Outer Join in Hive

How to execute SQL script through Hive? (Class -15)

The Hive scripts are used to collectively execute a series of SQL Commands. In other words, multiple SQL statements can be grouped and executed all at once. This helps to reduce the time and effort expended in manually drafting and executing any command.

Hive supports scripting from versions of Hive 0.10.0 and beyond.

Steps need to follow such as:

a) Create a Database à create db name;

b) Create a table

c) Load the data

d) Select the data

We need to do above steps in script.

Go to Cloudera / Cloudxlab:

Develop the HQL script.

How to handle the Incremental Data in Hive? (Class -14)

Incremental data is Dynamic data which is updated data receives on day-to-day basis.

We use VIEWS to handle the incremental data.

2 types of VIEWS:

a) SIMPLE VIEW à creates view on single table.

b) COMPLEX VIEW à creates view on multiple tables.

SerDe properties of Hive (Class -13)

SerDe means Serializer and Deserializer. Hive uses SerDe and FileFormat to read and write table rows. Main use of SerDe interface is for IO operations.

Process of compressing the huge Data and serialize it to binary value format before transfer through HDFS though Network is called “Serialization”.

Data transferred through Serialization process finally reach the HDFS in the form of AVRO file format / Parquet file format.

Binary values of data in HDFS are converted to Human readable format through Deserialization process.

Hive supports various file formats like CSV , TEXT, ORC , PARQUET etc. We can change the file formats using the SET FILEFORMAT statement.

a) Location where AVRO file exists.

b) Scheme location - AVRO Schema.

ACID Transactions in Hive (Class -12)

Hive supports all ACID properties which enable us to use transactions, create transactional tables, and run queries like Insert, Update, and Delete on tables.

ACID stands for four traits of database transactions:

A à Atomicity (an operation either succeeds completely or fails, it does not leave partial data)

C à Consistency (once an application performs an operation the results of that operation are visible to it in every subsequent operation),

I à Isolated (an incomplete operation by one user doesn’t cause unexpected side effects for other users)

D à Durability (once an operation is complete it will be preserved even in the face of machine or system failure).

Acid Properties can be done on Internal tables not External.

We can do DML (Insert, Delete & Update) operations in Hive if version is greater than > 0.13.

To perform DML Operations, Certain conditions need to be followed in ACID properties:

a) Bucketed table

b) b) ORC

c) Transaction Property has to be true

Create Hive table in bucketed format following ORC properties:

hive> create table Hive_DML(EmployeeID Int,FirstName String,Designation String, Salary Int,Department String) clustered by (department) into 3 buckets stored as orc TBLPROPERTIES (‘transactional’=true’);