There are several types of Hive Query Optimization techniques are
available to optimize hive queries to execute them faster on
Hadoop Cluster.
Hive is a query
language which is similar to SQL built on Hadoop ecosystem can process Penta bytes of data.
1) 1) Partitions
1) 2) Bucketing
1) 3) Tez-Execution
1) 4) CBO à
Cost based optimization
1) 5) Vectorization
1) Hive Partitions:
Partitioning divides the table into parts based on the
values of particular columns. A table can have multiple partition columns to
identify a particular partition. Using partition it is easy to do queries on
slices of the data. The data of the partition columns are not saved in the
files.
When we query data on a partitioned table, it will
only scan the relevant partitions to be queried and skips irrelevant
partitions.
CREATE TABLE table_name (column1 data_type, column2
data_type, …) PARTITIONED BY (partition1 data_type, partition2 data_type,….);
Partitioning is basically of two types: Static and Dynamic. Well, names are very much self-explanatory.
2) Bucketing in Hive:CLUSTERED BY clause is used to divide the table into buckets. It
works well for the columns having high cardinality.
CREATE TABLE table_name (column1 data_type, column2
data_type, …) PARTITIONED BY (partition1 data_type, partition2 data_type,….)
CLUSTERED BY (clus_col1) SORTED BY (sort_col2) INTO n BUCKETS;
Property of Hive Bucketing:
set hive.enforce.bucketing = true;
In Hive Partition, each partition will be created as a directory. But in Hive Buckets, each bucket will be created as a file.
3) Tez-Execution:
Apache Tez is a client-side library which operates like an execution engine is a new application framework built on Hadoop Yarn. It executes complex-directed acyclic graphs of general data processing tasks. However, we can consider it to be a much more flexible and powerful successor to the map-reduce framework.
4) Cost-Based Optimization in Hive (CBO):
Before submitting for final execution Hive optimizes each
Query’s logical and physical execution plan.
To use CBO, set the following parameters at the beginning of
your query:
set hive.cbo.enable=true;
set hive.compute.query.using.stats=true;
set hive.stats.fetch.column.stats=true;
set hive.stats.fetch.partition.stats=true;
5) Vectorization in Hive:
Vectorization operations refer to scans, aggregations, filters, and
joins. It happens by performing them in batches of 1024 rows at once instead of
single row each time. This feature is introduced in Hive 0.13.
It significantly improves query execution time, and is easily enabled
with two parameters settings:
set hive.vectorized.execution = true;
set hive.vectorized.execution.enabled = true;
hive> set
hive.vectorized.execution.enabled = true;
hive> create table vectorizedtable(state
string,id int) stored as orc;
hive>
explain select count(*) from vectorised table;
No comments:
Post a Comment