The Bucketing in Hive is a data organizing technique which is quite similar to partitioning in Hive with an added functionality that it divides large datasets into more manageable parts known as Buckets, when the implementation of partitioning becomes difficult.
The
concept of bucketing is based on the hashing technique.
Bucketing
will create Files while Partition will create Folders.
Bucketing
works on the principle of Modulo. Here, modules of current column value
and the number of required buckets is calculated (let say, F(x) % 2).
Bucketing comes into play when partitioning hive data sets into segments is not effective and can overcome over partitioning. “CLUSTERED BY” clause is used to do bucketing in Hive. The SORTED BY clause ensures local ordering in each bucket, by keeping the rows in each bucket ordered by one or more columns.
To read and store data in buckets, a hashing algorithm is used to calculate the bucketed column value (simplest hashing function is modulus). For example, if we decide to have a total number of buckets to be 10, data will be stored in column value % 10, ranging from 0-9 (0 to n-1) buckets.
Bucketing is preferred for high cardinality columns as files are physically split into buckets. Each bucket in the Hive is created as a file. Bucket numbering is 1- based.
Steps for
Bucketing Table:
- First, select the Database in which we want to create a table.
- Use Database.
- Create a Table to Load the Table.
- Load the Data (Local Inpath) into Table.
- Enable the bucketing by using the following command: -
hive> set hive.enforce.bucketing = true;
- Create a bucketing table by using the following command: -
hive> create table emp_bucket(Id int, Name string , Salary float) clustered by (Id) into 2 buckets row format delimited fields terminated by ',' ;
- Insert the Data table into the bucketed table.
- We can see that the Data is divided into two buckets.
- Let's retrieve the data of buckets 0 & 1.
Example Data below:
Create a
Bucketing Table:
To create a Hive table with
bucketing, use CLUSTERED BY clause
with the column name you wanted to bucket and the count of the buckets.
Loading/inserting data into the Bucketing table would be
the same as inserting data into the table.
If you are using Hive < 2.x version, you
need to set the hive.enforce.bucketing
property
to true
. You don’t have to set this
if you are using Hive 2.x or later.
See 1
Mapper and 2 Reducers.
See 1
Mapper and 3 Reducers
Now use
product and String in Bucketing:
Now Create Bucketing on Date Column:
Bucketing
for Category
No comments:
Post a Comment