MapReduce is a component of the Apache Hadoop ecosystem, a framework that enhances massive data processing. Other components of Apache Hadoop include Hadoop Distributed File System (HDFS), Yarn, and Apache Pig.
MapReduce is a programming paradigm that enables massive scalability across hundreds or thousands of servers in a Hadoop cluster. As the processing component, MapReduce is the heart of Apache Hadoop.
In other words, MR component enhances the processing of massive data using dispersed and parallel algorithms in the Hadoop ecosystem. This programming model is applied in social platforms and e-commerce to analyze huge data collected from online users.
Map Reduce (MR):
Hadoop was created on Map Reduce (MR). Google started using MR since 2004.
MR have 2 Programs - Mapper & Reducer.
We can use Map Reduce (MR) framework in Java, C, C++.
Mapper will convert every line into Key Value <key, value> Pairs,
Process of bringing similar data into single system is called Shuffle in MR.
Shuffle phase in Hadoop transfers the map output from Mapper to a Reducer in MapReduce.
Sort phase in MapReduce covers the merging and sorting of map outputs.
Data from the Mapper are grouped by the key, split among reducers and sorted by the key.
Every reducer obtains all values associated with the same key.
The Hadoop MapReduce framework spawns one map task for each InputSplit generated by the InputFormat for the job.
Reducer reduces a set of intermediate values which share a key to a smaller set of values.
YARN:
YARN is cluster Manager comes along with Hadoop 2.x version.
Important feature of YARN is, it handles and schedules resource request from the application and help the process to execute the request.
YARN is a generic platform to run any distributed application, Map Reduce version 2 is the distributed application which runs on top of YARN, Whereas map reduce is processing unit of Hadoop component, it process data in parallel in the distributed environment.
Comparison |
Map Reduce |
YARN |
Meaning |
Processing engine of Hadoop that processes and computes vast volumes
of data. |
Allocate system resources to the various applications running in a
Hadoop cluster and scheduling tasks to be executed on different cluster
nodes. |
Version |
Introduce in Hadoop 1.0 |
Introduce in Hadoop 2.0 |
Responsibility |
Resource Management as well as Data Processing. |
Resource Management part. |
Execution Model |
Less Generic and execute their own model based application. |
More Generic and execute those application which don't follow MR. |
Daemons |
It has Name Node, Data Node, Secondary Name Node, Job tracker and Task
tracker. |
It has Name Node, Data Node, Secondary Name Node, Resource Manager and
Node Manager. |
Limitations |
Low Resource utilization (max. of 4200 clusters). Chance of Single
Point of Failure and less Scalability. |
There is no concept of Single Point of Failure because of Multiple
Master Nodes. |
Size |
Default size of Data node is 64MB. |
Default size of Data node is 128MB. |
No comments:
Post a Comment