Labels

Monday, February 7, 2022

Map Reduce (MR) versus YARN in Hadoop (Class -3)

MapReduce is a component of the Apache Hadoop ecosystem, a framework that enhances massive data processing. Other components of Apache Hadoop include Hadoop Distributed File System (HDFS), Yarn, and Apache Pig.

MapReduce is a programming paradigm that enables massive scalability across hundreds or thousands of servers in a Hadoop cluster. As the processing component, MapReduce is the heart of Apache Hadoop.

In other words, MR component enhances the processing of massive data using dispersed and parallel algorithms in the Hadoop ecosystem. This programming model is applied in social platforms and e-commerce to analyze huge data collected from online users.

Where as YARN stands for “Yet Another Resource Negotiator“. 
     YARN also allows different data processing engines like graph processing, interactive processing, stream processing as well as batch processing to run and process data stored in HDFS (Hadoop Distributed File System) thus making the system much more efficient.

Map Reduce (MR):

Hadoop was created on Map Reduce (MR). Google started using MR since 2004.

MR have 2 Programs - Mapper & Reducer.

We can use Map Reduce (MR) framework in Java, C, C++. 


Hadoop by default will launch 1 Reducer. Word Count explanation through Map Reduce.

Hadoop Mapper is a function or task which is used to process all input records from a file and generate the output which works as input for Reducer. It produces the output by returning new key-value pairs.

Mapper will convert every line into Key Value <key, value> Pairs,

Process of bringing similar data into single system is called Shuffle in MR.

Shuffle phase in Hadoop transfers the map output from Mapper to a Reducer in MapReduce. 

     Sort phase in MapReduce covers the merging and sorting of map outputs. 

     Data from the Mapper are grouped by the key, split among reducers and sorted by the key. 

     Every reducer obtains all values associated with the same key.


The Hadoop MapReduce framework spawns one map task for each InputSplit generated by the InputFormat for the job.

Reducer reduces a set of intermediate values which share a key to a smaller set of values.

YARN:

YARN is cluster Manager comes along with Hadoop 2.x version.

Important feature of YARN is, it handles and schedules resource request from the application and help the process to execute the request.

YARN is a generic platform to run any distributed application, Map Reduce version 2 is the distributed application which runs on top of YARN, Whereas map reduce is processing unit of Hadoop component, it process data in parallel in the distributed environment.

Comparison

Map Reduce

YARN

Meaning

Processing engine of Hadoop that processes and computes vast volumes of data.

Allocate system resources to the various applications running in a Hadoop cluster and scheduling tasks to be executed on different cluster nodes.

Version

Introduce in Hadoop 1.0

Introduce in Hadoop 2.0

Responsibility

Resource Management as well as Data Processing.

Resource Management part.

Execution Model

Less Generic and execute their own model based application.

More Generic and execute those application which don't follow MR.

Daemons

It has Name Node, Data Node, Secondary Name Node, Job tracker and Task tracker.

It has Name Node, Data Node, Secondary Name Node, Resource Manager and Node Manager.

Limitations

Low Resource utilization (max. of 4200 clusters). Chance of Single Point of Failure and less Scalability.

There is no concept of Single Point of Failure because of Multiple Master Nodes.

Size

Default size of Data node is 64MB.

Default size of Data node is 128MB.


Key Difference Between MapReduce and Yarn

a) In Hadoop 1  it has two components first one is HDFS (Hadoop Distributed File System) and second is Map Reduce. Whereas in Hadoop 2 it has also two component HDFS and YARN/MRv2 (we usually called YARN as Map reduce version 2).


b) In Map Reduce, when Map-reduce stops working then automatically all his slave node will stop working this is the one scenario where job execution can interrupt and it is called a single point of failure. YARN overcomes this issue because of its architecture, YARN has the concept of Active name node as well as standby name node. When active node stop working for some time passive node starts working as active node and continue the execution.

c) Map reduce has single master and multiple slave architecture, If master-slave goes down then entire slave will stop working this is the single point of failure in HADOOP1, whereas HADOOP2 which is based on YARN architecture it has the concept of multiple master and slave, if one master goes down then another master will resume its process and continue the execution.

d) In Map reduce each data node run individually whereas in Yarn each data node runs by a node manager.

e) Map reduce uses Job tracker to create and assign a task to task tracker due to data the management of the resource is not impressive resulting as some of the data nodes will keep idle and is of no use, whereas in YARN has a Resource Manager for each cluster, and each data node runs a Node Manager. For each job, one slave node will act as the Application Master, monitoring resources/tasks.

Overall, YARN has a better result over Map-reduce in securing Data with no risk.



No comments:

Post a Comment