Spark was introduced by Apache Software Foundation is a lightning-fast cluster computing technology, designed for speeding up the Hadoop computational computing software process.
Spark is not a modified version of Hadoop and not entirely dependent on
Hadoop because it has its own cluster management. The
main feature of Spark is its in-memory cluster computing that
increases the processing speed of an application.
Spark is not an
ecosystem of Hadoop as it can run individually. Spark uses Hadoop in two ways –
one is Storage and second is Processing (MapReduce).
Hadoop frameworks are
known for analyzing datasets based on a simple programming model (MapReduce)
and main concern is to maintain speed in processing large datasets in terms of
waiting time between queries and waiting time to run the program.
Earlier Hadoop Versions:
Hadoop 1.0 introduced
in 2006 and used up-to 2012 until Hadoop 2.0 (YARN) came into the picture.
Main drawbacks of Hadoop 1.0 are
a) Single
Point of Failure
b) Block Size
c) Relying on
MapReduce (MR) [1970] for Resource management and processing engine.
In 2008, Cloudera becomes the
commercial version of Hadoop which is open source and enterprise.
Spark is
one of Hadoop’s sub project developed in 2009 in UC Berkeley’s AMPLab by Matei
Zaharia when they are testing on Resource manager called Mesos Cluster, not for
processing the data.
Spark was
Open Sourced in 2010 under a BSD license. It was donated to Apache software
foundation in 2013, and now Apache Spark has become a top level Apache project
from Feb-2014.
Advantages
of Spark:
a)
Unified Framework
for different kinds of data processing.
b)
In-memory
processing framework.
c)
Spark Core for handling
Script kind of language.
d)
Spark SQL to
handle SQL kind of Analytics.
e)
Spark Streaming
to handle Real Time Data.
f)
Spark Structured
Streaming for Real Time Data through SQL Tables.
g)
Spark MLIB to
process the Data through Machine Learning.
h)
Spark Giraph for
Graph processing of Data.
Spark can
use
a)
RAM for both
processing & storing the data.
b)
Hard Disk for
storing the data.
Spark can
be written in 4 languages:
a)
Scala
b)
Python
c)
Java and
d)
R
Once we
install the Spark, we will have 4 Libraries by default:
a)
Spark Core
b)
Spark SQL
c)
Spark Streaming
d)
Spark MLlib
We can
run Spark (Resource Allocation) in 4
modes:
a)
YARN Cluster
(Hadoop)
b)
MESOS
c)
Spark Standalone
d)
Zookeeper
Spark get
resources from 4 of the above.
Spark can
read the Data from any File Systems:
a)
HDFS
b)
Amazon S3
c)
ADLS (Microsoft
Azure Data Lake Storage)
d)
Linux file system
e)
RDBMS (MySQL,
Oracle, PostGre, etc.)
f)
NoSQL (HBASE,
Cassandra, MangoDB, ES, DynamoDB, Cosmos DB)
Note: Spark has no storage; it can only process the data (Read
operations). Spark can write the data to HDFS / Hive / HBASE / Oracle DB / AWS
S3 / ADLS / Cassandra, etc. (Write Operations).
Spark as In-Memory Processing Framework:
Spark
uses same Input-Output (IO) model as MapReduce (MR).
MapReduce
(MR) Input-Output (I/O) Operations for Read and Write Data:
MR has disadvantage of multiple IO
Operations which increases Disk Latency and decrease Performance.
Spark IO
Operations:
Unlike
MR, Spark can read data from both Disk and In-Memory. It can also store data in
In-Memory.
IO will
interact with the file system to read the data and write the data.
Storages
in Reading the Data are RDBMS, HDFS, AWS S3, ADLS, Local File System (Linux).
Storing the Data In-Memory is a Developer Response not default process.
IDE to do
Spark Scala programs:
No comments:
Post a Comment