Apache Spark is an open-source cluster-computing framework for real time processing which is 100 times faster in memory and 10 times faster on disk when compared to Apache Hadoop.
Apache Spark has a well-defined architecture
integrated with various extensions and libraries where all the spark components
and layers are loosely coupled.
Spark is a
distributed processing engine and it follows the Master-Slave architecture. So,
for every Spark Application, it will create one master process and multiple
slave processes.
When you run a Spark application, Spark Driver
creates a context that is an entry point to your application, and all
operations (transformations and actions) are executed on worker nodes, and the
resources are managed by Cluster Manager.
Features of
Apache Spark:
Powerful
Caching: Simple programming layer provides powerful caching
and disk persistence capabilities.
Deployment: It can be
deployed through Mesos, Hadoop via YARN, or Spark’s own cluster manager.
Real-Time: It offers
Real-time computation & low latency because of in-memory
computation.
Polyglot: Spark
provides high-level APIs in Java, Scala, Python, and R. Spark code can be
written in any of these four languages. It also provides a shell in Scala and
Python.
In
Spark terminology,
Driver is the Master, while Executors are Slaves. Thus, Spark creates one driver, and a batch of executors for each application.
a)
Client Mode: Driver on client machine and
executers on cluster.
b)
Cluster Mode: Driver and executors on cluster.
Two ways to execute programs on a Spark Cluster:
a)
Interactive Clients (Scala Shell, Pyspark,
Notebooks)
b)
Submit a Job (Spark submit utility)
1.
Apache YARN: the resource manager in Hadoop 2.0. This is mostly used,
cluster manager.
2.
Apache Mesos: Mesons is a Cluster manager that can also run Hadoop
MapReduce and Spark applications.
1.
Kubernetes: an open-source system for automating deployment, scaling,
and management of containerized applications.
2.
Standalone: a simple cluster manager included with Spark that makes it
easy to set up a cluster.
YARN is the most widely used cluster
manager of Apache Spark.
- Resilient Distributed Dataset (RDD)
- Directed Acyclic Graph (DAG)
Spark RDD
(Resilient Distributed Data Set) is a resilient, partitioned, distributed and
immutable collection of data which appears to be a Scala collection. Resilient
means recover from a failure as RDDs are fault tolerant. RDD are created by
loading the data from a source.
a) No. of
Partitions
b) No. of
Executors
a) Transformations
(Lazy functions)
b) Actions
(Non-Lazy functions)
Transformation operations create a new
distributed dataset from an existing distributed dataset. So, they create a new
RDD from an existing RDD. E.g.:- Map,
ReduceBy Key
Actions are mainly performed to send results back to the driver and hence they produce a non-distributed dataset. E.g.:- collect
No comments:
Post a Comment