SK DATA SHARE

Sunday, May 15, 2022

Spark Deployment modes (Class -42)

Spark applications can be deployed and executed using spark-submit in a shell command on a cluster. It can use any of the cluster managers like YARN, Mesos or its own Cluster manager through its uniform interface and there is no extra configuration needed for each one of them separately.

For deploying our Spark application on a cluster, we need to use the spark-submit script of Spark.

Spark has to do the process:

Spark-submit --class classname mode jar inputfile outputfile location

The spark-submit command is a utility to run or submit a Spark or PySpark application program (or job) to the cluster by specifying options and configurations, the application you are submitting can be written in Scala, Java, or Python (PySpark) code. You can use this utility in order to do the following.

1. Submitting Spark application on different cluster managers like Yarn, Kubernetes, Mesos, and Stand-alone.

2. Submitting Spark application on client or cluster deployment modes

Spark application needs to be deployed into 3 modes:

a) Local à Spark itself allocates the Resources (Standalone).

b) YARN Client à Driver will be running in Edge node. (Dev)

c) YARN Cluster à Driver will be running in any one of the Data Nodes. (Prod)

We need to create Spark environment for IDEs like Eclipse & IntelliJ Idea. After creating the environment, we need to write the code into Executable File (Jar) using Build tool (MAVEN). Finally, take the Jar file and put it into the Cluster.

Temp Tables in SQL (Part -8)

As its name indicates, temporary tables are used to store data temporarily and they can perform CRUD (Create, Read, Update, and Delete).

Temp table will not exist once application has been closed.

Temporary tables are dropped when the session that creates the table has closed, or can also be explicitly dropped by users. At the same time, temporary tables can act like physical tables in many ways, which gives us more flexibility. Such as, we can create constraints, indexes, or statistics in these tables. SQL Server provides two types of temporary tables according to their scope:

Local Temporary Table
Global Temporary Table

Spark Core RDD Operations (Class -41)

Resilient Distributed Dataset (RDDs) are created by starting with a file in the Hadoop file system (or any other Hadoop-supported file system), or an existing Scala collection in the driver program, and transforming it.

At a high level, every Spark application consists of a driver program that runs the user’s main function and executes various parallel operations on a cluster. The main abstraction Spark provides is a resilient distributed dataset (RDD), which is a collection of elements partitioned across the nodes of the cluster that can be operated on in parallel.

Apache Spark RDD supports two types of Operations-

· Transformations

· Actions

Spark Ecosystem (Class -40)

Apache Spark is an Open Source analytical framework for large scale powerful distributed data processing and machine learning applications. Spark has become a top-level Apache project since Feb 2014. Spark is a general-purpose, in-memory, fault-tolerant, distributed processing engine that allows you to process data efficiently 100x faster than traditional systems.

Using Spark we can process data from Hadoop HDFS, AWS S3, Databricks DBFS, Azure Blob Storage, and many file systems. Spark also is used to process real-time data using Streaming and Kafka.

Spark ecosystem consists of 5 tightly integrated components which are

Spark Core

Spark SQL

Spark Streaming

MLlib

GraphX

SPARK Introduction (Class -39)

Spark was introduced by Apache Software Foundation is a lightning-fast cluster computing technology, designed for speeding up the Hadoop computational computing software process.

Spark is not a modified version of Hadoop and not entirely dependent on Hadoop because it has its own cluster management. The main feature of Spark is its in-memory cluster computing that increases the processing speed of an application.

Spark is not an ecosystem of Hadoop as it can run individually. Spark uses Hadoop in two ways – one is Storage and second is Processing (MapReduce).

Hadoop frameworks are known for analyzing datasets based on a simple programming model (MapReduce) and main concern is to maintain speed in processing large datasets in terms of waiting time between queries and waiting time to run the program.

Earlier Hadoop Versions:

Hadoop 1.0 introduced in 2006 and used up-to 2012 until Hadoop 2.0 (YARN) came into the picture.

Main drawbacks of Hadoop 1.0 are

a) Single Point of Failure

b) Block Size

c) Relying on MapReduce (MR) [1970] for Resource management and processing engine.

In 2008, Cloudera becomes the commercial version of Hadoop which is open source and enterprise.

Spark is one of Hadoop’s sub project developed in 2009 in UC Berkeley’s AMPLab by Matei Zaharia when they are testing on Resource manager called Mesos Cluster, not for processing the data.

Spark was Open Sourced in 2010 under a BSD license. It was donated to Apache software foundation in 2013, and now Apache Spark has become a top level Apache project from Feb-2014.

SK DATA SHARE

Labels

Sunday, May 15, 2022

Spark Deployment modes (Class -42)

Sunday, May 1, 2022

Temp Tables in SQL (Part -8)

Saturday, April 30, 2022

Spark Core RDD Operations (Class -41)

Friday, April 29, 2022

Spark Ecosystem (Class -40)

Sunday, April 24, 2022

SPARK Introduction (Class -39)

About Me

Data Posts

Contact Form

Followers

Data Analytics (DA) Course

Wikipedia

Total Pageviews