Labels

Sunday, February 6, 2022

Architecture of HADOOP Cluster (Class -2)

HADOOP is a Framework which handles (storing & processing) Big Data. Here Framework is an already defined rules & regulations to process the data. 

A Framework cannot act both as Operating System (OS) and PT (Processing Technique).

Hadoop can handle only Structured and Semi-structured data.

Cluster is group of systems (servers) interconnected together. 

In real time, capacity of Hadoop clusters are allow from 3 nodes to 30,000 nodes machine servers interconnected.

Only Admins of Company create Hadoop Cluster not developers. Admins run Balancing commands with Replication factor of 3 and maintain Hadoop clusters.

Admins give indirect access of Hadoop cluster [HDFS] to Developers through Linux OS system which link to Edge Node / Gateway Terminal / Gateway Node.


3 Core Components of HADOOP are  - 

A) HDFS [Distributed File System] built on Java language used for Data Storage. 

B) MR [MapReduce] for Data Processing. - MR version is very costly.

C) YARN - Yet Another Resource Negotiator (YARN) is a resource management unit.


MapReduce is a framework using which we can write applications to process huge amounts of data, in parallel, on large clusters of commodity hardware in a reliable manner.

HDFS [data locality] is a distributed process going through distributed Data invented by Doug Cutting in 2005. Data Locality is a Distributed Storage.

Hadoop HDFS split large files into small chunks known as Blocks. Block is the physical representation of data. It contains a minimum amount of data that can be read or write. HDFS stores each file as blocks. Hadoop framework break files into 128 MB blocks and then stores into the Hadoop file system.


In 2008, Cloudera is Distribution system (HDFS) which comes into picture for commercial distribution of Hadoop.

In 2012, Hortonworks [HW} uses original works of Hadoop both Open Source (free) & Enterprise (License) versions.

In 2020. Cloudera merged with Hortonworks to make CDP platform to release new Hadoop version.

Main configuration file for HADOOP is core-site.xml.

Core_site.xml is config file where we have Internet Protocol (IP) details of Data Node and Name Node.

NameNode is the master node in Hadoop Distributed File System that manages the file system metadata while the DataNode is a slave node in Hadoop distributed file system that stores the actual data as instructed by the NameNode.

For Example: 8 Node system of Hadoop cluster is divided into 

         5 Data Nodes [1 TB, 32GB] - slave system

         2 Name Nodes [500 GB, 16GB] - master system

         1 Edge Node [500 GB, 8GB]


Core_site.xml is config file where we have Data Node and Name Node Internet Protocol (IP) details.

Name Nodes will come to know the available storage [Meta Data info.] in Data Nodes using a file called "FsImage". Format of FsImage will be saved in md5 (Meta Data).


For every 30 seconds, all Data Nodes (DN) sends Block Report to Name Node regarding the available Storage Info.

Edit Log in RAM / Hard Disk of Name Node server will have last 1 Hr. transaction details.

FsImage in both RAM & Hard Disk will sync for every 1 Hr.

Transactions Merging of FsImage in RAM with Edit Log in Hard Disk [HDD] total in Secondary Name Node Server (SNN) is called "Check Pointing".

HADOOP Architecture (Master & Slave) consists of 2 versions - 1.x (2006) and 2.x (2012).

Hadoop 1.0 / 1.x:

  Block size of Hadoop 1.x is 64 MB.

Main drawback of Hadoop 1.x is Name Node doesn't have any backup which is called "Single Point of Failure". In case of Breakdown in Name Node in Hadoop 1.x, Name Node will do Check Pointing through Synchronization process.

Heartbeat Interval in Hadoop 1.x is 3 seconds.  In computer science, a heartbeat is a periodic signal generated by hardware or software to indicate normal operation or to synchronize other parts of a computer system.

Heartbeat messages are typically sent non-stop on a periodic or recurring basis from the originator's start-up until the originator's shutdown.

  If the Data Node in HDFS does not send heartbeat to Name Node in ten minutes, then Name Node considers the Data Node to be out of service  and the Blocks replicas hosted by that Data Node to be unavailable.


Default Replication factor of 3 will be configured in file "hdfs_site.xml" where we have details of Block Size and Heart Beat Intervals.

In Hadoop 1.x, HDFS is the File System while MR used as the processing technique which access our OS and Resource Allocator is MR (OS).

Hadoop 2.0 / 2.x:

In Hadoop 2.x, HDFS is the File System while MR used as the processing technique which access our OS and Resource Allocator is YARN (OS) which is cluster Manager.

  Block size of Hadoop 2.x is 128 MB.

2 Kinds of Name Nodes in Hadoop 2.x are Active NN and Standby NN.

Last 1 Hr. transactions in Hadoop 2.x will be saved in Journal Node (which are 3 to 5 in default).

In Hadoop 2.x, Zookeeper maintains Hadoop as a Single Unit and is responsible for synchronization of Hadoop tasks. Zookeeper identifies Active / Standby Name Node with the help of Inuse.lock file.


Zookeeper is a unit where the information regarding configuration, naming and group services are stored. It is a centralized unit and using these information. 

Split Brain Scenario is the process of having Inuse.lock file in the both the Name Nodes [Active & Standby].


Applications of Hadoop in different sectors:

a) Banking (e.g. Bank of America - BOA)
b) Retail (e.g. Walmart)
c) Automobile (e.g. MBRDI)
d) Financial (e.g. FinServ)

No comments:

Post a Comment