Labels

Wednesday, May 18, 2022

Spark SQL Introduction (Class -45)

Spark SQL is the powerful library to run SQL queries to process the data. 

Spark Session is the starting point in Spark SQL which has required libraries (APIs) to process the data to SQL by using a data structure called ‘DataFrame’ which is nothing but a Spark Table (we cannot write SQL queries). We need to register or convert the DataFrame to Spark SQL Table to process the data through SQL queries.

 In Spark DataFrame, we use ‘df’ as command with immutable variable ‘val df’. While displaying the data, we use ‘df.show()’, instead of ‘println’. ‘df show()’ command display maximum of 20 lines.


Monday, May 16, 2022

Spark Broadcast Accumulators (Class -44)

Shared variables are second abstraction in Spark after RDDs that can be used in parallel operations. By default, when Spark runs a function in parallel as a set of tasks on different nodes, it ships a copy of each variable used in the function to each task. Sometimes, a variable needs to be shared across tasks, or between tasks and the driver program.

 Spark supports two types of shared variables: Broadcast variables, which can be used to cache a value in memory on all nodes, and Accumulators, which are variables that are only “added” to, such as counters and sums.


Spark Architecture (Class -43)

 Apache Spark is an open-source cluster-computing framework for real time processing which is 100 times faster in memory and 10 times faster on disk when compared to Apache Hadoop.

Apache Spark has a well-defined architecture integrated with various extensions and libraries where all the spark components and layers are loosely coupled.

Spark is a distributed processing engine and it follows the Master-Slave architecture. So, for every Spark Application, it will create one master process and multiple slave processes.

When you run a Spark application, Spark Driver creates a context that is an entry point to your application, and all operations (transformations and actions) are executed on worker nodes, and the resources are managed by Cluster Manager.

Features of Apache Spark:

Sunday, May 15, 2022

Spark Deployment modes (Class -42)

Spark applications can be deployed and executed using spark-submit in a shell command on a cluster. It can use any of the cluster managers like YARN, Mesos or its own Cluster manager through its uniform interface and there is no extra configuration needed for each one of them separately.

For deploying our Spark application on a cluster, we need to use the spark-submit script of Spark.

Spark has to do the process:

                 Spark-submit  --class classname mode  jar inputfile outputfile location



The spark-submit command is a utility to run or submit a Spark or PySpark application program (or job) to the cluster by specifying options and configurations, the application you are submitting can be written in Scala, Java, or Python (PySpark) code. You can use this utility in order to do the following.

1.    Submitting Spark application on different cluster managers like Yarn, Kubernetes, Mesos, and Stand-alone.

2.    Submitting Spark application on client or cluster deployment modes

Spark application needs to be deployed into 3 modes:

a)      Local à Spark itself allocates the Resources (Standalone).

b)      YARN Client à Driver will be running in Edge node. (Dev)

c)       YARN Cluster à Driver will be running in any one of the Data Nodes. (Prod)

We need to create Spark environment for IDEs like Eclipse & IntelliJ Idea. After creating the environment, we need to write the code into Executable File (Jar) using Build tool (MAVEN). Finally, take the Jar file and put it into the Cluster.

Sunday, May 1, 2022

Temp Tables in SQL (Part -8)

As its name indicates, temporary tables are used to store data temporarily and they can perform CRUD (Create, Read, Update, and Delete).

Temp table will not exist once application has been closed.

Temporary tables are dropped when the session that creates the table has closed, or can also be explicitly dropped by users. At the same time, temporary tables can act like physical tables in many ways, which gives us more flexibility. Such as, we can create constraints, indexes, or statistics in these tables. SQL Server provides two types of temporary tables according to their scope:

  • Local Temporary Table
  • Global Temporary Table