SK DATA SHARE: Performance Optimization Techniques in Hive (Class -17)

There are several types of Hive Query Optimization techniques are available to optimize hive queries to execute them faster on Hadoop Cluster.

Hive is a query language which is similar to SQL built on Hadoop ecosystem can process Penta bytes of data.

1) 1) Partitions

1) 2) Bucketing

1) 3) Tez-Execution

1) 4) CBO à Cost based optimization

1) 5) Vectorization

1) Hive Partitions:

Partitioning divides the table into parts based on the values of particular columns. A table can have multiple partition columns to identify a particular partition. Using partition it is easy to do queries on slices of the data. The data of the partition columns are not saved in the files.

When we query data on a partitioned table, it will only scan the relevant partitions to be queried and skips irrelevant partitions.

CREATE TABLE table_name (column1 data_type, column2 data_type, …) PARTITIONED BY (partition1 data_type, partition2 data_type,….);

Partitioning is basically of two types: Static and Dynamic. Well, names are very much self-explanatory.

2) Bucketing in Hive:

Bucketing provides flexibility to further segregate the data into more manageable sections called buckets or clusters. Bucketing is based on the hash function, which depends on the type of the bucketing column.

CLUSTERED BY clause is used to divide the table into buckets. It works well for the columns having high cardinality.

CREATE TABLE table_name (column1 data_type, column2 data_type, …) PARTITIONED BY (partition1 data_type, partition2 data_type,….) CLUSTERED BY (clus_col1) SORTED BY (sort_col2) INTO n BUCKETS;

Property of Hive Bucketing:

set hive.enforce.bucketing = true;

In Hive Partition, each partition will be created as a directory. But in Hive Buckets, each bucket will be created as a file.

3) Tez-Execution:

Apache Tez is a client-side library which operates like an execution engine is a new application framework built on Hadoop Yarn. It executes complex-directed acyclic graphs of general data processing tasks. However, we can consider it to be a much more flexible and powerful successor to the map-reduce framework.

4) Cost-Based Optimization in Hive (CBO):

Before submitting for final execution Hive optimizes each Query’s logical and physical execution plan.

To use CBO, set the following parameters at the beginning of your query:

set hive.cbo.enable=true;

set hive.compute.query.using.stats=true;

set hive.stats.fetch.column.stats=true;

set hive.stats.fetch.partition.stats=true;

5) Vectorization in Hive:

Vectorization operations refer to scans, aggregations, filters, and joins. It happens by performing them in batches of 1024 rows at once instead of single row each time. This feature is introduced in Hive 0.13.

It significantly improves query execution time, and is easily enabled with two parameters settings:

set hive.vectorized.execution = true;

set hive.vectorized.execution.enabled = true;

hive> set hive.vectorized.execution.enabled = true;

hive> create table vectorizedtable(state string,id int) stored as orc;

hive> explain select count(*) from vectorised table;

SK DATA SHARE

Labels

Wednesday, March 23, 2022

Performance Optimization Techniques in Hive (Class -17)

No comments:

Post a Comment

About Me

Data Posts

Contact Form

Followers

Data Analytics (DA) Course

Wikipedia

Total Pageviews