Labels

Wednesday, March 23, 2022

JOINS in Hive (Class -16)

HIVE JOIN is a clause that is used for combining specific fields from two tables by using values common to each one. It is used to combine records from two or more tables in the database.

In other words, to combine records from two or more tables in the database we use JOIN clause. However, it is more or less similar to SQL JOIN. Also, we use it to combine rows from multiple tables.

Following is a syntax of Hive Join – HiveQL Select Clause.
join_table:
  table_reference JOIN table_factor [join_condition]
  | table_reference {LEFT|RIGHT|FULL} [OUTER] JOIN table_reference
  join_condition
  | table_reference LEFT SEMI JOIN table_reference join_condition
  | table_reference CROSS JOIN table_reference [join_condition]

Basically, there are 4 types of Hive Join. Such as:

1.       Inner join in Hive

2.       Left Outer Join in Hive 

3.       Right Outer Join in Hive

4.      Full Outer Join in Hive 


3 other JOINS of Hive are:

A) Map Side Join
B) Bucket Map Join
C) SMB Join

A) Map Side Join:

Apache Hive Map Join is also known as Auto Map Join, or Map Side Join, or Broadcast Join used to speed up Hive Queries, when there is huge difference in Size of Data between Two (2) Tables esp. one of the data size of tables is less than 40 MB.

             It is the type of join where a smaller table is loaded into memory and the join is done in the map phase of the MapReduce job.

Basically, before the original MapReduce task, its first step is to create a MapReduce local task. However from HDFS, these map/reduce task read data of the small table. Further save it into an in-memory hash table, then into a hash table file.

 Parameters of Hive Map Side Join:

 set hive.auto.convert.join=true;

set hive.auto.convert.join.noconditionaltask=true;

set hive.auto.convert.join.noconditionaltask.size=10000000;



 SELECT /*+ MAPJOIN(dataset2) */ dataset1.first_name, dataset1.eid,dataset2.eid FROM dataset1 JOIN dataset2 ON dataset1.first_name = dataset2.first_name;

 Output:

 Query ID = gadirajumidhun6255_20200312080956_071a4b72-50a8-4052-882c-69d7b6c3ade6

 Total jobs = 1

 Execution log at: /tmp/gadirajumidhun6255/gadirajumidhun6255_20200312080956_071a4b72-50a8-4052-882c-69d7b6c3ade6.log

 2020-03-12 08:10:00 Starting to launch local task to process map join; maximum memory = 523239424

 2020-03-12 08:10:01 Dump the side-table for tag: 1 with group count: 197 into file: file:/tmp/gadirajumidhun6255/898d9d1d-2149-4bc5-a676-daf69a8b6664/hive_2020-03-12_08-09-56_408_8491807554864277797-1/-local-10003/HashTable-Stage-3/MapJoin-mapfile01--.hashtable

 2020-03-12 08:10:01 Uploaded 1 File to: file:/tmp/gadirajumidhun6255/898d9d1d-2149-4bc5-a676-daf69a8b6664/hive_2020-03-12_08-09-56_408_8491807554864277797-1/-local-10003/HashTable-Stage-3/MapJoin-mapfile01--.hashtable (11826 bytes)

 2020-03-12 08:10:01 End of local task; Time Taken: 1.406 sec.

 Execution completed successfully

 MapredLocal task succeeded

 Launching Job 1 out of 1

 Number of reduce tasks is set to 0 since there's no reduce operator

 Starting Job = job_1582215230149_5225, Tracking URL = http://cxln2.c.thelab-240901.internal:8088/proxy/application_1582215230149_5225/

 Kill Command = /usr/hdp/2.6.2.0-205/hadoop/bin/hadoop job -kill job_1582215230149_5225

 Hadoop job information for Stage-3: number of mappers: 1; number of reducers: 0

 2020-03-12 08:10:11,342 Stage-3 map = 0%, reduce = 0%

 2020-03-12 08:10:17,640 Stage-3 map = 100%, reduce = 0%, Cumulative CPU 3.98 sec.


B) Bucket Map Join:

In Apache Hive, when tables are large and all the tables used in the join are bucketed on the join columns we use Hive Bucket Map Join feature. Moreover, one table should have buckets in multiples of the number of buckets in another table in this type of join.


For suppose if one table has 2 buckets then the other table must have either 2 buckets or a multiple of 2 buckets (2, 4, 6, and so on). Further, since the preceding condition is satisfied then the joining can be done on the mapper side only.

Also, note that by default Hive does not support a bucket map join.

So, we need to set the following property to true for the query to work as this join:

set hive.optimize.bucketmapjoin = true;

 hive> CREATE TABLE IF NOT EXISTS dataset1_bucketed ( eid int,first_name String, last_name String, email String, gender String, ip_address String) clustered by(first_name) into 4 buckets row format delimited fields terminated BY ', ';


C) SMB Join:

When input data sizes of both the tables are large and same, and then go for Hive Sort Merge Bucket join (SMB) for good optimization. Although, make sure in SMB join all tables should have the same number of buckets.

Basically, in Mapper, only Join is done. Moreover, all the buckets are joined with each other at the mappers who are corresponding.

Properties need to be set before run SMB Join:

hive> Set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;

hive> set hive.optimize.bucketmapjoin = true;

hive> set hive.optimize.bucketmapjoin.sortedmerge = true;

hive> insert overwrite table dataset1_bucketed select * from dataset1 sort by first_name;




No comments:

Post a Comment