
Sunday, April 24, 2022

SPARK Introduction (Class -39)

Spark was introduced by Apache Software Foundation is a lightning-fast cluster computing technology, designed for speeding up the Hadoop computational computing software process.

Spark is not a modified version of Hadoop and not entirely dependent on Hadoop because it has its own cluster management. The main feature of Spark is its in-memory cluster computing that increases the processing speed of an application.

Spark is not an ecosystem of Hadoop as it can run individually. Spark uses Hadoop in two ways – one is Storage and second is Processing (MapReduce)

Hadoop frameworks are known for analyzing datasets based on a simple programming model (MapReduce) and main concern is to maintain speed in processing large datasets in terms of waiting time between queries and waiting time to run the program.

Earlier Hadoop Versions:

Hadoop 1.0 introduced in 2006 and used up-to 2012 until Hadoop 2.0 (YARN) came into the picture.

Main drawbacks of Hadoop 1.0 are

a)    Single Point of Failure

b)    Block Size

c)    Relying on MapReduce (MR) [1970] for Resource management and processing engine.

In 2008, Cloudera becomes the commercial version of Hadoop which is open source and enterprise.

Spark is one of Hadoop’s sub project developed in 2009 in UC Berkeley’s AMPLab by Matei Zaharia when they are testing on Resource manager called Mesos Cluster, not for processing the data.

Spark was Open Sourced in 2010 under a BSD license. It was donated to Apache software foundation in 2013, and now Apache Spark has become a top level Apache project from Feb-2014.

Advantages of Spark:

a)      Unified Framework for different kinds of data processing.

b)      In-memory processing framework.

c)       Spark Core for handling Script kind of language.

d)      Spark SQL to handle SQL kind of Analytics.

e)      Spark Streaming to handle Real Time Data.

f)       Spark Structured Streaming for Real Time Data through SQL Tables.

g)      Spark MLIB to process the Data through Machine Learning.

h)      Spark Giraph for Graph processing of Data.

Spark can use

a)      RAM for both processing & storing the data.

b)      Hard Disk for storing the data.

Spark can be written in 4 languages:

a)      Scala

b)      Python

c)       Java and

d)      R

Once we install the Spark, we will have 4 Libraries by default:

a)      Spark Core

b)      Spark SQL

c)       Spark Streaming

d)      Spark MLlib

We can run Spark (Resource Allocation)  in 4 modes:

a)      YARN Cluster (Hadoop)

b)      MESOS

c)       Spark Standalone

d)      Zookeeper

Spark get resources from 4 of the above.

Spark can read the Data from any File Systems:

a)      HDFS

b)      Amazon S3

c)       ADLS (Microsoft Azure Data Lake Storage)

d)      Linux file system

e)      RDBMS (MySQL, Oracle, PostGre, etc.)

f)       NoSQL (HBASE, Cassandra, MangoDB, ES, DynamoDB, Cosmos DB)

Note: Spark has no storage; it can only process the data (Read operations). Spark can write the data to HDFS / Hive / HBASE / Oracle DB / AWS S3 / ADLS / Cassandra, etc. (Write Operations).

Spark as In-Memory Processing Framework:

Spark uses same Input-Output (IO) model as MapReduce (MR).

MapReduce (MR) Input-Output (I/O) Operations for Read and Write Data:

    MR has disadvantage of multiple IO Operations which increases Disk Latency and decrease Performance.

Spark IO Operations:

Unlike MR, Spark can read data from both Disk and In-Memory. It can also store data in In-Memory.


IO will interact with the file system to read the data and write the data.

Storages in Reading the Data are RDBMS, HDFS, AWS S3, ADLS, Local File System (Linux).

Storing the Data In-Memory is a Developer Response not default process.

IDE to do Spark Scala programs:

a)      Cloudera

b)      IntelliJ Idea


Spark will consider each and every line as a Record and every Record as a String.

No comments:

Post a Comment