Labels

Tuesday, March 29, 2022

Partitioning vs Bucketing - Key Differences in Hive (Class -19)

Hive is a distributed Data Warehouse system that manages the data stored in HDFS (Hadoop Distributed File System) and provides a SQL-like language (HiveQL) for querying the data

For data storage, Hive has four main components for organizing data: databases, tables, partitions and buckets.



Partitioning is a mechanism that roughly divides the table based on the value of the partition column. Each partition in Hive corresponds to many subdirectories of the table, and all data is put into different subdirectories according to the partition column. Entire Table does not need to be scanned when searching for data in the partition, which is very helpful for improving search efficiency.

Buckets are implemented by performing hash calculations on specified columns. The data in the partition can be further divided into buckets. Unlike partitions that directly split the columns, the buckets often use the hash value of the column to break up the data and distribute it to different buckets to complete the data bucketing process. When the number of partitions is so large that it may cause the file system to crash , we need to use bucketing to solve the problem.

Differences

Partitioning

Bucketing

Meaning

Hive Partition is a way to split the large table into smaller tables based on the values of a column.

Bucket is Clustering technique to split the data into more manageable files (by specify how many buckets you want).

Similarity

Divide the Data into Multiple Parts & then Scan only One Part of it.

 improve performance by eliminating table scans when dealing with a large set of data on HDFS.

Columns

You can have one or more Partition columns.

You can have only one Bucketing column.

Commands

Uses PARTITIONED BY

Uses CLUSTERED BY

Cardinality

partitioning on a column with Low Cardinality (no. of Partitions = no. of Distinct Values)

Cardinality of a Column is high (fixed no. of Buckets)

Based on

Logical Division

Hash Function of a Column.

Usage

It is effective when the data volume in each partition is not very high.

If some map-side joins are involved in your queries, then bucketed tables are a good option.

Storage

Partition is a Folder.

Bucket is a File.



No comments:

Post a Comment