Labels

Monday, May 16, 2022

Spark Architecture (Class -43)

 Apache Spark is an open-source cluster-computing framework for real time processing which is 100 times faster in memory and 10 times faster on disk when compared to Apache Hadoop.

Apache Spark has a well-defined architecture integrated with various extensions and libraries where all the spark components and layers are loosely coupled.

Spark is a distributed processing engine and it follows the Master-Slave architecture. So, for every Spark Application, it will create one master process and multiple slave processes.

When you run a Spark application, Spark Driver creates a context that is an entry point to your application, and all operations (transformations and actions) are executed on worker nodes, and the resources are managed by Cluster Manager.

Features of Apache Spark:


Speed: Spark runs up to 100 times faster than Hadoop MapReduce for large-scale data processing. It is also able to achieve this speed through controlled partitioning.

 

Powerful Caching: Simple programming layer provides powerful caching and disk persistence capabilities.

 

Deployment: It can be deployed through Mesos, Hadoop via YARN, or Spark’s own cluster manager.

 

Real-Time: It offers Real-time computation & low latency because of in-memory computation.

 

Polyglot: Spark provides high-level APIs in Java, Scala, Python, and R. Spark code can be written in any of these four languages. It also provides a shell in Scala and Python.



In Spark terminology,

Driver is the Master, while Executors are Slaves. Thus, Spark creates one driver, and a batch of executors for each application.

 Spark offers two deployment modes for an application.

a)      Client Mode: Driver on client machine and executers on cluster.

b)      Cluster Mode: Driver and executors on cluster.



Two ways to execute programs on a Spark Cluster: 

a)      Interactive Clients (Scala Shell, Pyspark, Notebooks)

b)      Submit a Job (Spark submit utility)

 Apache Spark supports 4 different cluster managers.

1.       Apache YARN: the resource manager in Hadoop 2.0. This is mostly used, cluster manager.

2.       Apache Mesos: Mesons is a Cluster manager that can also run Hadoop MapReduce and Spark applications.

1.       Kubernetes: an open-source system for automating deployment, scaling, and management of containerized applications.

2.       Standalone: a simple cluster manager included with Spark that makes it easy to set up a cluster.

YARN is the most widely used cluster manager of Apache Spark.

 Typical Spark Application Process consists of Data Frames & Datasets which ultimately compiled down to RDD.

 Apache Spark Architecture is based on two main abstractions:

  • Resilient Distributed Dataset (RDD)
  • Directed Acyclic Graph (DAG)

Spark RDD (Resilient Distributed Data Set) is a resilient, partitioned, distributed and immutable collection of data which appears to be a Scala collection. Resilient means recover from a failure as RDDs are fault tolerant. RDD are created by loading the data from a source.

 In other words, RDD is collection of items distributed across many compute nodes that can be manipulated in parallel. They are Spark’s main programming abstraction.

 Spark breaks the RDD into smaller chunks of data pieces which are called Partitions, spreads across the cluster.


 Two main variables to control the degree of parallelism in Apache Spark:

a)      No. of Partitions

b)      No. of Executors

 All the Partitions are queued to Executors. Partitioned Data along with Function are distributed across the Executors to return the Result. Thus Apache Spark breaks our Code and gets it executed in parallel. This approach emphasizes the importance of the Functional Programming.

 RDDs offer two types of Operations:

a)      Transformations (Lazy functions)

b)      Actions (Non-Lazy functions)

Transformation operations create a new distributed dataset from an existing distributed dataset. So, they create a new RDD from an existing RDD. E.g.:- Map, ReduceBy Key

 Actions are mainly performed to send results back to the driver and hence they produce a non-distributed dataset. E.g.:- collect

 All transformations in spark are lazy, means they don’t compute results until an action requires them to provide results. Thus, action on RDD triggers the Job.

 Spark breaks the Job into two Stages using shuffle activity. Each Stage is executed in parallel tasks. The no. of parallel tasks is directly dependent on the no. of partitions.

 However, apart from tasks, the degree of parallelism is also limited by the number of available executors.

 When a job is submitted, driver implicitly converts user code that contains transformations and actions into a logically Directed Acyclic Graph called DAG. At this stage, it also performs optimizations such as pipelining transformations. DAG Scheduler converts the graph into stages. A new stage is created based on the shuffling boundaries.





No comments:

Post a Comment