google.com, pub-4600324410408482, DIRECT, f08c47fec0942fa0 SK DATA SHARE

Labels

Sunday, March 20, 2022

Hive Metastore (Class -11)

Central Repository location where Hive stores all the Schemas [Column names, Data types, etc.] are called ‘Metastore’ (Derby).  It stores metadata for Hive tables (like their schema and location) and partitions in a relational database. It provides client access to this information by using metastore service API.

By default, Hive uses a built-in Derby SQL server.

Now when you run your Hive query and you are using the default Derby database, you will find that your current directory now contains a new sub-directory, metastore_db

The default value of this property is jdbc:derby:;databaseName=metastore_db;create=true. This value specifies that you will be using the embedded Derby as your Hive metastore, and the location of the metastore is metastore_db.

We can also configure the directory for the Hive to store table information. By default, the location of the warehouse is file:///user/hive/warehouse and we can also use the hive-site.xml file for the local or remote metastore.

MySQL server à Metastore (Hive table schema details) where most of the projects use.


Friday, March 18, 2022

Bucketing in Hive (Class -10)

The Bucketing in Hive is a data organizing technique which is quite similar to partitioning in Hive with an added functionality that it divides large datasets into more manageable parts known as Buckets, when the implementation of partitioning becomes difficult. 

 In other words, Hive Bucketing is a way to split the table into a managed number of clusters with or without partitions.

The concept of bucketing is based on the hashing technique.

Bucketing will create Files while Partition will create Folders.

Bucketing works on the principle of Modulo. Here, modules of current column value and the number of required buckets is calculated (let say, F(x) % 2).

Partitions in Hive (Class -9)

Apache Hive allows us to organize the table into multiple partitions where we can group the same kind of data together. It is used for distributing the load horizontallyThese smaller logical tables are not visible to users and users still access the data from just one table.

Hive organizes tables into Partitions à way of dividing a table into related parts based on the values of particular columns like date, city, and department.

Each table in the hive can have one or more partition keys to identify a particular partition. Using partition it is easy to do queries on slices of the data. 

Partitioning in Hive distributes execution load horizontally.

In partition faster execution of queries with the low volume of data takes place.


When you load the data into the partition table, Hive internally splits the records based on the partition key and stores each partition data into a sub-directory of tables directory on HDFS. The name of the directory would be partition key and it’s value.

There are two types of Partitioning in Apache Hive-

·         Static Partitioning

·         Dynamic Partitioning

Thursday, March 17, 2022

How to create a Table in Hive? (Class -8)

 A table in Hive is a set of data that uses a schema to sort the data by given identifiers.

The way of creating tables in the hive is very much similar to the way we create tables in SQL. 

We can perform the various operations with these tables like Joins, Filtering, etc. 


Introduction to HIVE Architecture (Class -7)

HIVE developed by Facebook in 2011, is a data warehouse infrastructure tool to process structured data in Hadoop. It’s a framework built on top of Hadoop to summarize Big Data, and makes querying and analyzing easy.

Without learning SQL, you cannot work on HIVE. Facebook later sold the HIVE project to Apache became “Apache Hive” launched in competition to Apache PIG developed by Yahoo.

Data present in the Part File of Hadoop are converted in the form of Tables [Rows & Columns] by HIVE which later processes the Tabular data using SQL popularly known as ‘HQL’ (Hive Query Language) which follows 1992 syntax.

Whatever we do in HIVE, we see the reflection in HDFS, but not vice versa. Hive will not process XML /JSON / RC /Sequence data. Hive will allow only AVRO / PARQUET / ORC file formats.