SK DATA SHARE: Partitioning vs Bucketing - Key Differences in Hive (Class -19)

Tuesday, March 29, 2022

Partitioning vs Bucketing - Key Differences in Hive (Class -19)

Hive is a distributed Data Warehouse system that manages the data stored in HDFS (Hadoop Distributed File System) and provides a SQL-like language (HiveQL) for querying the data.

For data storage, Hive has four main components for organizing data: databases, tables, partitions and buckets.

Partitioning is a mechanism that roughly divides the table based on the value of the partition column. Each partition in Hive corresponds to many subdirectories of the table, and all data is put into different subdirectories according to the partition column. Entire Table does not need to be scanned when searching for data in the partition, which is very helpful for improving search efficiency.

Buckets are implemented by performing hash calculations on specified columns. The data in the partition can be further divided into buckets. Unlike partitions that directly split the columns, the buckets often use the hash value of the column to break up the data and distribute it to different buckets to complete the data bucketing process. When the number of partitions is so large that it may cause the file system to crash , we need to use bucketing to solve the problem.

Differences	Partitioning	Bucketing
Meaning	Hive Partition is a way to split the large table into smaller tables based on the values of a column.	Bucket is Clustering technique to split the data into more manageable files (by specify how many buckets you want).
Similarity	Divide the Data into Multiple Parts & then Scan only One Part of it.	improve performance by eliminating table scans when dealing with a large set of data on HDFS.
Columns	You can have one or more Partition columns.	You can have only one Bucketing column.
Commands	Uses PARTITIONED BY	Uses CLUSTERED BY
Cardinality	partitioning on a column with Low Cardinality (no. of Partitions = no. of Distinct Values)	Cardinality of a Column is high (fixed no. of Buckets)
Based on	Logical Division	Hash Function of a Column.
Usage	It is effective when the data volume in each partition is not very high.	If some map-side joins are involved in your queries, then bucketed tables are a good option.
Storage	Partition is a Folder.	Bucket is a File.

SK DATA SHARE

Labels

Tuesday, March 29, 2022

Partitioning vs Bucketing - Key Differences in Hive (Class -19)

No comments:

Post a Comment

About Me

Data Posts

Contact Form

Followers

Data Analytics (DA) Course

Wikipedia

Total Pageviews