SK DATA SHARE: March 2022

Wednesday, March 30, 2022

Static vs Dynamic Partitions - Key Differences in Hive (Class -20)

Two types of Partitions in Hive - Static and Dynamic, both operate on the same basic principles of Partitioning. Once the partitions are created, the tables won’t have any difference like static and dynamic partitions. All partitions are treated and one and the same.

Partitioning vs Bucketing - Key Differences in Hive (Class -19)

Hive is a distributed Data Warehouse system that manages the data stored in HDFS (Hadoop Distributed File System) and provides a SQL-like language (HiveQL) for querying the data.

For data storage, Hive has four main components for organizing data: databases, tables, partitions and buckets.

Internal vs External Tables - Key Differences in Hive (Class -18)

The main difference between an internal table and an external table in Hive:

An Internal table is also called a managed table, meaning it’s “managed” by Hive. When you 'DROP' the Internal table, Hive will delete both the schema/definition and the Metadata and it will also physically delete the data/rows (truncation) associated with that table from the Hadoop Distributed File System (HDFS).

An External table is not “managed” by Hive. When you drop an external table, the schema/table definition is deleted and gone, but the data/rows associated with it are left alone. The table’s rows are not deleted.

Performance Optimization Techniques in Hive (Class -17)

There are several types of Hive Query Optimization techniques are available to optimize hive queries to execute them faster on Hadoop Cluster.

Hive is a query language which is similar to SQL built on Hadoop ecosystem can process Penta bytes of data.

1) 1) Partitions

1) 2) Bucketing

1) 3) Tez-Execution

1) 4) CBO à Cost based optimization

1) 5) Vectorization

JOINS in Hive (Class -16)

HIVE JOIN is a clause that is used for combining specific fields from two tables by using values common to each one. It is used to combine records from two or more tables in the database.

In other words, to combine records from two or more tables in the database we use JOIN clause. However, it is more or less similar to SQL JOIN. Also, we use it to combine rows from multiple tables.

Following is a syntax of Hive Join – HiveQL Select Clause.
join_table:
  table_reference JOIN table_factor [join_condition]
  | table_reference {LEFT|RIGHT|FULL} [OUTER] JOIN table_reference
  join_condition
  | table_reference LEFT SEMI JOIN table_reference join_condition
  | table_reference CROSS JOIN table_reference [join_condition]

Basically, there are 4 types of Hive Join. Such as:

1. Inner join in Hive

2. Left Outer Join in Hive

3. Right Outer Join in Hive

4. Full Outer Join in Hive

How to execute SQL script through Hive? (Class -15)

The Hive scripts are used to collectively execute a series of SQL Commands. In other words, multiple SQL statements can be grouped and executed all at once. This helps to reduce the time and effort expended in manually drafting and executing any command.

Hive supports scripting from versions of Hive 0.10.0 and beyond.

Steps need to follow such as:

a) Create a Database à create db name;

b) Create a table

c) Load the data

d) Select the data

We need to do above steps in script.

Go to Cloudera / Cloudxlab:

Develop the HQL script.

How to handle the Incremental Data in Hive? (Class -14)

Incremental data is Dynamic data which is updated data receives on day-to-day basis.

We use VIEWS to handle the incremental data.

2 types of VIEWS:

a) SIMPLE VIEW à creates view on single table.

b) COMPLEX VIEW à creates view on multiple tables.

SerDe properties of Hive (Class -13)

SerDe means Serializer and Deserializer. Hive uses SerDe and FileFormat to read and write table rows. Main use of SerDe interface is for IO operations.

Process of compressing the huge Data and serialize it to binary value format before transfer through HDFS though Network is called “Serialization”.

Data transferred through Serialization process finally reach the HDFS in the form of AVRO file format / Parquet file format.

Binary values of data in HDFS are converted to Human readable format through Deserialization process.

Hive supports various file formats like CSV , TEXT, ORC , PARQUET etc. We can change the file formats using the SET FILEFORMAT statement.

a) Location where AVRO file exists.

b) Scheme location - AVRO Schema.

ACID Transactions in Hive (Class -12)

Hive supports all ACID properties which enable us to use transactions, create transactional tables, and run queries like Insert, Update, and Delete on tables.

ACID stands for four traits of database transactions:

A à Atomicity (an operation either succeeds completely or fails, it does not leave partial data)

C à Consistency (once an application performs an operation the results of that operation are visible to it in every subsequent operation),

I à Isolated (an incomplete operation by one user doesn’t cause unexpected side effects for other users)

D à Durability (once an operation is complete it will be preserved even in the face of machine or system failure).

Acid Properties can be done on Internal tables not External.

We can do DML (Insert, Delete & Update) operations in Hive if version is greater than > 0.13.

To perform DML Operations, Certain conditions need to be followed in ACID properties:

a) Bucketed table

b) b) ORC

c) Transaction Property has to be true

Create Hive table in bucketed format following ORC properties:

hive> create table Hive_DML(EmployeeID Int,FirstName String,Designation String, Salary Int,Department String) clustered by (department) into 3 buckets stored as orc TBLPROPERTIES (‘transactional’=true’);

See above table is created in 3 Buckets form stored in ORC file format and transactional property is true.

Now Insertion will allow but Update and Delete will not allow in table until properties have been set.

Below are the properties you need to enable ACID transactions.

a) Now do the Insert operations,

Now Update record in Hive:

Now Delete a file from Hive:

Go to Default path whether in Cloudera / Cloudxlab:

For every transaction, One Delta File will be created.

Sunday, March 20, 2022

Hive Metastore (Class -11)

Central Repository location where Hive stores all the Schemas [Column names, Data types, etc.] are called ‘Metastore’ (Derby). It stores metadata for Hive tables (like their schema and location) and partitions in a relational database. It provides client access to this information by using metastore service API.

By default, Hive uses a built-in Derby SQL server.

Now when you run your Hive query and you are using the default Derby database, you will find that your current directory now contains a new sub-directory, metastore_db.

The default value of this property is jdbc:derby:;databaseName=metastore_db;create=true. This value specifies that you will be using the embedded Derby as your Hive metastore, and the location of the metastore is metastore_db.

We can also configure the directory for the Hive to store table information. By default, the location of the warehouse is file:///user/hive/warehouse and we can also use the hive-site.xml file for the local or remote metastore.

MySQL server à Metastore (Hive table schema details) where most of the projects use.

Bucketing in Hive (Class -10)

The Bucketing in Hive is a data organizing technique which is quite similar to partitioning in Hive with an added functionality that it divides large datasets into more manageable parts known as Buckets, when the implementation of partitioning becomes difficult.

In other words, Hive Bucketing is a way to split the table into a managed number of clusters with or without partitions.

The concept of bucketing is based on the hashing technique.

Bucketing will create Files while Partition will create Folders.

Bucketing works on the principle of Modulo. Here, modules of current column value and the number of required buckets is calculated (let say, F(x) % 2).

Partitions in Hive (Class -9)

Apache Hive allows us to organize the table into multiple partitions where we can group the same kind of data together. It is used for distributing the load horizontally. These smaller logical tables are not visible to users and users still access the data from just one table.

Hive organizes tables into Partitions à way of dividing a table into related parts based on the values of particular columns like date, city, and department.

Each table in the hive can have one or more partition keys to identify a particular partition. Using partition it is easy to do queries on slices of the data.

Partitioning in Hive distributes execution load horizontally.

In partition faster execution of queries with the low volume of data takes place.

When you load the data into the partition table, Hive internally splits the records based on the partition key and stores each partition data into a sub-directory of tables directory on HDFS. The name of the directory would be partition key and it’s value.

There are two types of Partitioning in Apache Hive-

· Static Partitioning

· Dynamic Partitioning

How to create a Table in Hive? (Class -8)

A table in Hive is a set of data that uses a schema to sort the data by given identifiers.

The way of creating tables in the hive is very much similar to the way we create tables in SQL.

We can perform the various operations with these tables like Joins, Filtering, etc.

Introduction to HIVE Architecture (Class -7)

HIVE developed by Facebook in 2011, is a data warehouse infrastructure tool to process structured data in Hadoop. It’s a framework built on top of Hadoop to summarize Big Data, and makes querying and analyzing easy.

Without learning SQL, you cannot work on HIVE. Facebook later sold the HIVE project to Apache became “Apache Hive” launched in competition to Apache PIG developed by Yahoo.

Data present in the Part File of Hadoop are converted in the form of Tables [Rows & Columns] by HIVE which later processes the Tabular data using SQL popularly known as ‘HQL’ (Hive Query Language) which follows 1992 syntax.

Whatever we do in HIVE, we see the reflection in HDFS, but not vice versa. Hive will not process XML /JSON / RC /Sequence data. Hive will allow only AVRO / PARQUET / ORC file formats.

SK DATA SHARE

Labels

Wednesday, March 30, 2022

Static vs Dynamic Partitions - Key Differences in Hive (Class -20)

Tuesday, March 29, 2022

Partitioning vs Bucketing - Key Differences in Hive (Class -19)

Internal vs External Tables - Key Differences in Hive (Class -18)

Wednesday, March 23, 2022

Performance Optimization Techniques in Hive (Class -17)

JOINS in Hive (Class -16)

How to execute SQL script through Hive? (Class -15)

How to handle the Incremental Data in Hive? (Class -14)

SerDe properties of Hive (Class -13)

Tuesday, March 22, 2022

ACID Transactions in Hive (Class -12)

Sunday, March 20, 2022

Hive Metastore (Class -11)

Friday, March 18, 2022

Bucketing in Hive (Class -10)

Partitions in Hive (Class -9)

Thursday, March 17, 2022

How to create a Table in Hive? (Class -8)

Introduction to HIVE Architecture (Class -7)

About Me

Data Posts

Contact Form

Followers

Data Analytics (DA) Course

Wikipedia

Total Pageviews