Labels

Tuesday, February 22, 2022

Apache SQOOP Import (SQL to Hadoop to SQL) Internal Architecture (Class -6)

Apache SQOOP (SQL-to-Hadoop) is a tool designed to import bulk Data into HDFS from Structured data stores such as Relational databases [RDBMS], enterprise data warehouses, and NoSQL systems. 

In beginning of Ingestion Phase, We pull the structured data from RDBMS (Source layer) to HDFS in Hadoop with the help of Data Ingestion Tools (given with example Companies) such as:

a) SQOOP (Bank of America)

b) SPARK (TCS)

c) TALEND (Bank of England)

d) KAFKA (Cap Gemini)

e) Rest API (HCL)

f) SOAP Services

In other words, importing the Data from RDBMS to HDFS in Hadoop system is called "SQOOP Import Process". Data in SQOOP will be saved in the form of "Part File".


Moreover, Sqoop can transfer bulk data efficiently between Hadoop and external data stores like as enterprise data warehouses, relational databases, etc.
A tool which exports a set of files from HDFS back to an RDBMS is a SQOOP Export tool. Moreover, there are files which behave as input to Sqoop which also contain records. Those files what we call as Rows in the table.

This is how Sqoop got its name -“SQL to Hadoop & Hadoop to SQL”. Sqoop does not perform any aggregations it just imports and exports the data. 


Structured Data in RDBMS can be processed with the help of language called ‘Structured Query Language’ [SQL].


Applications of RDBMS server:

a)      a) Education [Universities]

a)      b) Retail [Amazon, Flipkart, Walmart, etc.]

a)      c) Automobile [BMW, Audi, etc.]

a)      d) Insurance

RDBMS has advantage over DBMS in removing Duplication of Data and Normalization.

We store the data only in RDBMS because of ACID Properties which is DML (Insert, Update & Delete) operation.

Different Kinds of RDBMS server:

a)      Oracle RDB

a)      b) MySQL

c)   c) Postgres

a)      d) Teradata

a)      e) Netezza

a)      f) DB2

a)     g)  Informix

JDBC (Java Data Base Connector) is a connector that connects SQOOP to RDBMS Server.

             Sqoop is rarely using now-a-days for data import.


Steps in SQOOP Import Architecture:


a)   Create Record Container Class

Internally, RDBMS is built on SQL while HDFS is built on JAVA.

We know that SQL is a Query Language while JAVA is a Programming Language.


In RBBMS, ID à Integer & firstname à varchar; while in

    HDFS, ID à Integer & firstname à string


b) Boundary Query Creation


Help to run multiple no. of Records (say 1 lac) with the help of Mappers (say 4).


c) Data Import [with available 4 Mappers]

        Out of 3, Data Import step will take much time.


Based on the no. of Mappers, SQOOP make same no. of Connectors with RDBMS to import data.

Default File format of Sqoop Import is textfile through which stored the data in HDFS.

SQOOP Import requires following:

a) Database
b) Username
c) Password
d) Table Name
e) Table Directory

Above things that can be changed in the future need to be saved in "config file".

Let's Workouts with example:


1)      1) Use retail_db and create tables:


1)   2)    Create and Select the test database:

create database test;

 use test;


1)   3) CREATE TABLE user_data(userid INT,firstname VARCHAR(20),lastname VARCHAR(20),city varchar(50),age int,createdt date,transactamt int);

insert into user_data values(1,'siva','Raj','chennai',33,'2015-09-20',100000);

insert into user_data values(2,'Prasad','babu','chennai',33,'2015-09-21',10000);

insert into user_data values(3,'harish','kumar','banglore',39,'2015 -09-23',90000);

insert into user_data values(4,'azar','imran','hyderabad',33,'2015-09-24',1000);

insert into user_data values(5,'Kiran','kumar','chennai',23,'2015-09-20',200000);



1)     4)  Select  *  from user_data; [table name]

Above data will be store in RDBMS.


Host Id will be given by Client. Press “Control C” to come out of mysql.

Check with HDFS:



Cloudera is a single node cluster while Cloudxlab is a 5 node cluster.

1)    5)   Import Table from SQL to HDFS:

A)a)      sqoop import --connect jdbc:mysql://localhost/test --username root --password cloudera --table user_data -m 1; hadoop fs -rm -r /user/cloudera/user_data


Map Reduce will run:


b)       sqoop import --connect jdbc:mysql://localhost/test --username root --password cloudera --table user_data -m 1; --target-dir /user/cloudera/second_imp


Data will be stored in the HDFS in the form of file system call “Part File”.



c)   sqoop import --connect jdbc:mysql://localhost/test --username root --password cloudera -table user_data -m 1 --direct; hadoop fs -rm -r /user/cloudera/user_data





d)    
sqoop import --connect jdbc:mysql://localhost/test --username root --password cloudera -table user_data -m 1 --target-dir sqoop_import;






In entire, Sqoop only Mapper will run not Reducer.

Success Marker tells whether your job done successfully or not. Success marker only run once.


e)  Check whether the below import works?

sqoop import --connect jdbc:mysql://localhost/test --username root --password cloudera -table user_data -m 2;


f)  Import with --split-by option


If you want to use more than 1 Mapper, then use - -Split-by along with custid (column) command.

 sqoop import --connect jdbc:mysql://localhost/test --username root --password cloudera --table user_data -m 2 --split-by custid;





If we create a table with Primary Key, then no need to mention ‘Split-by option’.

 Primary Key in Table inside RDBMS split the data to HDFS even without using the command ‘split-by’.

 Sqoop  will have its native drivers.

--direct is Sqoop import property.


g)   Sqoop Eval:


Using Sqoop, we can see the Data present in MySQL server from the Edge Node using the command called ‘Sqoop Eval’. We can insert and update data using command ‘sqoop eval’.

 sqoop eval --connect jdbc:mysql://localhost/test --username root --password cloudera --query “insert into user_data values(6,'ramesh','babu','manglore',39,'2015 -09-21',100000);”;


h)   SQOOP DATA:  In Real Time, we have 2 types of Data:


A)    A)  Static Data [new record everyday]

A)   B)   Dynamic Data / Delta Data / Incremental Modified Data which is CDC (Change Data Capture)

A)  A)    Static Data:

New Records data added every day is called ‘Static Data’ which cannot be modified.

    Static Data in Sqoop can be modified using Incremental Append which takes from ‘0’ value of cust / user id – 1.

 Incremental will take care of new records, while Append will insert the new records into the same directory.

 The APPEND option can limit data movement to only new or updated records.

Example:

                   Execute below insert in mysql:

 insert into user_data values(7,'md','irfan1','hyderabad',33,'2015 -09-28',10000);




Existed Table below:


Add new 7th Row:


First press “Ctrl C” to exit from mysql and give below command in Cloudera:

 sqoop import --connect jdbc:mysql://localhost/test --username root --password cloudera -table user_data -m 1 --target-dir incrimport --incremental append --check-column custid --last-value 0



Now 7th record is importing into Hadoop HDFS through above Map Reduce process.

sqoop import --connect jdbc:mysql://localhost/test --username root --password cloudera -table user_data -m 1 --target-dir incrimport --incremental append --check-column custid --last-value 6

Now again enter into mysql:


Important Parameters in “Incremental Append” command are:

a)      Check Column

b)  Last Value

By giving “Last Value” number (user id / cust id), the particular row will get import into HDFS through Sqoop import.  Finally, all the records got imported to Hadoop HDFS.


Use “Sqoop Job” in order to automate the Last Value (‘0’ only) as problem with the Incremental Append is that every time we need to change the Last Value.

i)     Create Sqoop Job First:

sqoop job --create Class10_SqoopJb -- import --connect jdbc:mysql://localhost/test --username root --password cloudera -table user_data -m 1 --target-dir incrimport --incremental append --check-column custid --last-value 0


ii) Now List  the Sqoop Job using - -list command:


iii) Now Execute the Sqoop Job using - - exec command:

It will ask password – type ‘cloudera’.







[cloudera@quickstart ~]$ cd .sqoop





Now create a Sqoop Job without giving Password.

[cloudera@quickstart ~]$ echo -n "cloudera" > sqp.password

To check last lines, give below command:


Give below command, to read the password file only by you:

[cloudera@quickstart ~]$ chmod 400 sqp.password


See above, only we can read the password file. Finally move the password file to HDFS using “put” command.

A)    B)  Dynamic Data:

Dynamic Data in Sqoop can handle modified data (Delta Load) on day to day basis by using command “- -incremental lastmodified’ where Reducer (instead of Mapper) will run the process in HDFS during import.


Sequence File is the Binary format where humans cannot read the data which used to store the Key Value Pairs.

 Transferring of data from RDBMS to HDFS will be in the form of Sequence File format will be a fast process and less time taken than Text File format.

 AVRO File Format is another data transfer format which convert Binary format to ROW Formatted Storage in HDFS which is widely used across Hive & Spark.

Command:  - -as-avrodatafile

2 types of Data Standardization in AVRO:

a)     a)  AVRO SCHEMA (Human readable data)

a)      b) AVRO DATA (in Binary format)

Parquet file format use by Spark to store data is developed by Cloudera together with Twitter.

All our Outputs get store in Parquet file format which is also called Columnar Storage where retrieval is very fast.

 Command: - -as-parquetfile

 So far, we have learned Sqoop import of one (1) table only.

There is a command for Sqoop import of all tables (say more than 1 table).

The Sqoop import-all-tables are a tool for importing a set of tables from the relational database to the Hadoop Distributed File System. On importing the set of tables, the data from each table is stored in the separate directory in HDFS.

 The syntax for Sqoop Import All Tables is:

sqoop import-all-tables

To import Single Table – Target Directory is in use.

To import Multiple Tables – Warehouse Directory should be use.

 --exclude-tables command is to get all tables except the excluded one in mysql.

 --validate command used to validate the data present in RDBMS & HDFS.

j) SQOOP EXPORT is transferring data from HDFS to RDBMS using PYSPARK.

Sqoop Export has 3 steps:

a)     a)  Record Container Class [HDFS (Java) file to RDBMS (SQL) Database]

a)     b)  Mapper Launch (1 mapper for every Block)

1 GB à 1*1024/128 = 8 Blocks – 8 mappers to transfer the data.

c)   Data Export will take place.

sqoop export --connect jdbc:mysql://localhost/test --username root --password cloudera --table customer_data --export-dir imp_del;

Saving the data in RDBMS is called ‘Committing’ the Data.

 Sqoop performs export row by row if we don’t leverage Batch Mode option. Enabling batch mode will export more than one row at a time as batch of rows.

Data from HDFS (part file) first export to Staging table then to Production table. This process is called ‘Transactionality’.

No comments:

Post a Comment