Labels

Tuesday, February 22, 2022

Apache SQOOP Import (SQL to Hadoop to SQL) Internal Architecture (Class -6)

Apache SQOOP (SQL-to-Hadoop) is a tool designed to import bulk Data into HDFS from Structured data stores such as Relational databases [RDBMS], enterprise data warehouses, and NoSQL systems. 

In beginning of Ingestion Phase, We pull the structured data from RDBMS (Source layer) to HDFS in Hadoop with the help of Data Ingestion Tools (given with example Companies) such as:

a) SQOOP (Bank of America)

b) SPARK (TCS)

c) TALEND (Bank of England)

d) KAFKA (Cap Gemini)

e) Rest API (HCL)

f) SOAP Services

In other words, importing the Data from RDBMS to HDFS in Hadoop system is called "SQOOP Import Process". Data in SQOOP will be saved in the form of "Part File".


Moreover, Sqoop can transfer bulk data efficiently between Hadoop and external data stores like as enterprise data warehouses, relational databases, etc.

Tuesday, February 15, 2022

Shell Scripts Commands in Hadoop HDFS (Class -5)

 The meaning of Shell is the interface between Human and Particular OS (e.g. Linux).

In Ingestion Phase, We extract Data from Source Layer with the help of tools like SQOOP, SPARK, TALEND, etc. and pulled into Hadoop (HDFS), where Data is distributed in the form of Blocks.

SQOOP Commands in Ingestion phase can be implemented through Shell Scripts to execute data into Hadoop.

Once the data is present in the Hadoop, we need to create a table/query in the blocks again by using Shell Scripts.

When shell commands are executed with the help of the file this is called shell scripting.

Data in Blocks are analysed and processed with the help of frameworks such as Hive, Spark, Spark SQL, etc., by using Shell Scripts to execute them.

All the commands in Edge Node need to be executed with the help of Shell Scripts. Shell will convert User input (commands) into Machine Language in the Linux OS environment.

Different Kinds of Shell are

a)      a) Bourne Again shell

a)     b)  Bash shell

a)     c)  C shell

a)      d) Tenex shell

a)     e)  Korn shell

      Bash shell is the default widely used comes along with every Linux OS platform.

$: $ sign is used in the shell to retrieve the value of variables.

echo: echo command is used to print the text or string to the shell or output file.


Wednesday, February 9, 2022

LINUX COMMANDS in Big Data (Class -4)

Linux is a family of open-source Unix-like operating systems based on the Linux kernel, an operating system kernel first released on September 17, 1991, by Linus Torvalds.

An Operating System is the software that directly manages a system's hardware and resources, like CPU, memory, and storage. The OS sits between applications and hardware and makes the connections between all of your software and the physical resources that do the work.

To store the data or process the data (add / remove / move/ execute scripts) into the Hadoop, Data Analyst has to do through Edge Node / Gateway which is Linux OS.

So, Big Data Developers need to know at least 20 Linux Commands, while ETL Developers need to know 50 to 60 Linux Commands.

But most People inclined towards Windows OS because of powerful User Interface (UI) than Linux OS UI.

In Real time Cluster creations, we use Linux OS because of Security and Compatibility.

We need to perform Linux commands in CLOUDERA distribution which needs to be installed along with VMware Workstation 16 Player.

Ctrl L – to refresh Cloudera screen.

Anything starts with

d – Directory (folder)

-r – file (store data)

Monday, February 7, 2022

Map Reduce (MR) versus YARN in Hadoop (Class -3)

MapReduce is a component of the Apache Hadoop ecosystem, a framework that enhances massive data processing. Other components of Apache Hadoop include Hadoop Distributed File System (HDFS), Yarn, and Apache Pig.

MapReduce is a programming paradigm that enables massive scalability across hundreds or thousands of servers in a Hadoop cluster. As the processing component, MapReduce is the heart of Apache Hadoop.

In other words, MR component enhances the processing of massive data using dispersed and parallel algorithms in the Hadoop ecosystem. This programming model is applied in social platforms and e-commerce to analyze huge data collected from online users.

Where as YARN stands for “Yet Another Resource Negotiator“. 
     YARN also allows different data processing engines like graph processing, interactive processing, stream processing as well as batch processing to run and process data stored in HDFS (Hadoop Distributed File System) thus making the system much more efficient.

Sunday, February 6, 2022

Architecture of HADOOP Cluster (Class -2)

HADOOP is a Framework which handles (storing & processing) Big Data. Here Framework is an already defined rules & regulations to process the data. 

A Framework cannot act both as Operating System (OS) and PT (Processing Technique).

Hadoop can handle only Structured and Semi-structured data.

Cluster is group of systems (servers) interconnected together. 

In real time, capacity of Hadoop clusters are allow from 3 nodes to 30,000 nodes machine servers interconnected.

Only Admins of Company create Hadoop Cluster not developers. Admins run Balancing commands with Replication factor of 3 and maintain Hadoop clusters.

Admins give indirect access of Hadoop cluster [HDFS] to Developers through Linux OS system which link to Edge Node / Gateway Terminal / Gateway Node.


Saturday, February 5, 2022

BIG DATA Introduction (Class -1)

Big Data is a technology which handles huge amount of Data which are of  3 types - Structured, Semi-structured & Unstructured.

Data is processed n information which can be Kilobytes (KBs) / Gigabytes (GBs) / Terabytes (TB) / Petabyte (PB) / Exabyte (EB) / Zettabyte (ZB) / Utabytes (UBs).

Approximately, 55 UBs are being generated every day.

Entire Data Science usually handles and divided among:

A) Data Engineers

B) Data Analysts

C) Data Scientist