SK DATA SHARE: Bucketing in Hive (Class -10)

The Bucketing in Hive is a data organizing technique which is quite similar to partitioning in Hive with an added functionality that it divides large datasets into more manageable parts known as Buckets, when the implementation of partitioning becomes difficult.

In other words, Hive Bucketing is a way to split the table into a managed number of clusters with or without partitions.

The concept of bucketing is based on the hashing technique.

Bucketing will create Files while Partition will create Folders.

Bucketing works on the principle of Modulo. Here, modules of current column value and the number of required buckets is calculated (let say, F(x) % 2).

Bucketing comes into play when partitioning hive data sets into segments is not effective and can overcome over partitioning. “CLUSTERED BY” clause is used to do bucketing in Hive. The SORTED BY clause ensures local ordering in each bucket, by keeping the rows in each bucket ordered by one or more columns.

To read and store data in buckets, a hashing algorithm is used to calculate the bucketed column value (simplest hashing function is modulus). For example, if we decide to have a total number of buckets to be 10, data will be stored in column value % 10, ranging from 0-9 (0 to n-1) buckets.

Bucketing is preferred for high cardinality columns as files are physically split into buckets. Each bucket in the Hive is created as a file. Bucket numbering is 1- based.

Steps for Bucketing Table:

First, select the Database in which we want to create a table.
Use Database.
Create a Table to Load the Table.
Load the Data (Local Inpath) into Table.
Enable the bucketing by using the following command: -

hive> set hive.enforce.bucketing = true;

Create a bucketing table by using the following command: -

hive> create table emp_bucket(Id int, Name string , Salary float) clustered by (Id) into 2 buckets row format delimited fields terminated by ',' ;

Insert the Data table into the bucketed table.
We can see that the Data is divided into two buckets.
Let's retrieve the data of buckets 0 & 1.

Example Data below:

Hive> set hive.enforce.bucketing = true;

Create a Bucketing Table:

To create a Hive table with bucketing, use CLUSTERED BY clause with the column name you wanted to bucket and the count of the buckets.

Loading/inserting data into the Bucketing table would be the same as inserting data into the table.

If you are using Hive < 2.x version, you need to set the hive.enforce.bucketing property to true. You don’t have to set this if you are using Hive 2.x or later.