HIVE JOIN is a clause that is used for combining specific fields from two tables by using values common to each one. It is used to combine records from two or more tables in the database.
In other words, to combine records from two or more tables in the
database we use JOIN clause. However, it is more or less similar to SQL JOIN.
Also, we use it to combine rows from multiple tables.
Following is a syntax
of Hive Join – HiveQL Select Clause.
join_table:
table_reference JOIN table_factor [join_condition]
| table_reference {LEFT|RIGHT|FULL} [OUTER] JOIN
table_reference
join_condition
| table_reference LEFT SEMI JOIN table_reference join_condition
| table_reference CROSS JOIN table_reference [join_condition]
Basically, there are 4 types of Hive Join. Such as:
1.
Inner join in Hive
2. Left Outer Join in Hive
3.
Right Outer Join in Hive
4.
Full Outer Join in Hive
Apache Hive Map Join is also known as Auto Map Join, or Map Side Join,
or Broadcast Join used to speed up Hive Queries, when there is huge difference in Size
of Data between Two (2) Tables esp. one of the data size of tables is less than
40 MB.
It
is the type of join where a smaller table is loaded into memory and the join is
done in the map phase of the MapReduce job.
Basically, before the original MapReduce task,
its first step is to create a MapReduce local task. However from HDFS, these map/reduce task read data of the
small table. Further save it into an in-memory hash table, then into a hash
table file.
set hive.auto.convert.join.noconditionaltask=true;
set hive.auto.convert.join.noconditionaltask.size=10000000;
SELECT /*+ MAPJOIN(dataset2) */ dataset1.first_name, dataset1.eid,dataset2.eid FROM dataset1 JOIN dataset2 ON dataset1.first_name = dataset2.first_name;
B) Bucket Map Join:
In Apache
Hive, when tables are large and all the tables used in
the join are bucketed on the join columns we use Hive Bucket Map Join feature.
Moreover, one table should have buckets in multiples of the number of buckets
in another table in this type of join.
For suppose if one table has 2 buckets then the other table must have
either 2 buckets or a multiple of 2 buckets (2, 4, 6, and so on). Further,
since the preceding condition is satisfied then the joining can be done on the
mapper side only.
Also, note
that by default Hive does not support a bucket map join.
So, we need to
set the following property to true for the query to work as this join:
set hive.optimize.bucketmapjoin = true;
When input data sizes of both the tables are large and same, and then go for Hive Sort Merge Bucket join (SMB) for good optimization. Although, make sure in SMB join all tables should have the same number of buckets.
Basically, in Mapper, only Join is done. Moreover, all the buckets are
joined with each other at the mappers who are corresponding.
Properties need to be set before run SMB Join:
hive> Set
hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;
hive> set
hive.optimize.bucketmapjoin = true;
hive> set
hive.optimize.bucketmapjoin.sortedmerge = true;
hive> insert overwrite table dataset1_bucketed select * from dataset1 sort by first_name;
No comments:
Post a Comment