google.com, pub-4600324410408482, DIRECT, f08c47fec0942fa0 SK DATA SHARE

Labels

Friday, June 3, 2022

SQL Group By Statement (Part -9)

 Grouping the similar data is called ‘Group By’.

The SQL GROUP BY clause is used in collaboration with the SELECT statement to arrange identical data into groups. This GROUP BY clause follows the WHERE clause in a SELECT statement and precedes the ORDER BY clause.

The GROUP BY statement is often used with aggregate functions (COUNT()MAX()MIN()SUM()AVG()) to group the result-set by one or more columns.

Display the count of each and every country.

Syntax:


Wednesday, May 18, 2022

Spark SQL Introduction (Class -45)

Spark SQL is the powerful library to run SQL queries to process the data. 

Spark Session is the starting point in Spark SQL which has required libraries (APIs) to process the data to SQL by using a data structure called ‘DataFrame’ which is nothing but a Spark Table (we cannot write SQL queries). We need to register or convert the DataFrame to Spark SQL Table to process the data through SQL queries.

 In Spark DataFrame, we use ‘df’ as command with immutable variable ‘val df’. While displaying the data, we use ‘df.show()’, instead of ‘println’. ‘df show()’ command display maximum of 20 lines.


Monday, May 16, 2022

Spark Broadcast Accumulators (Class -44)

Shared variables are second abstraction in Spark after RDDs that can be used in parallel operations. By default, when Spark runs a function in parallel as a set of tasks on different nodes, it ships a copy of each variable used in the function to each task. Sometimes, a variable needs to be shared across tasks, or between tasks and the driver program.

 Spark supports two types of shared variables: Broadcast variables, which can be used to cache a value in memory on all nodes, and Accumulators, which are variables that are only “added” to, such as counters and sums.


Spark Architecture (Class -43)

 Apache Spark is an open-source cluster-computing framework for real time processing which is 100 times faster in memory and 10 times faster on disk when compared to Apache Hadoop.

Apache Spark has a well-defined architecture integrated with various extensions and libraries where all the spark components and layers are loosely coupled.

Spark is a distributed processing engine and it follows the Master-Slave architecture. So, for every Spark Application, it will create one master process and multiple slave processes.

When you run a Spark application, Spark Driver creates a context that is an entry point to your application, and all operations (transformations and actions) are executed on worker nodes, and the resources are managed by Cluster Manager.

Features of Apache Spark:

Sunday, May 15, 2022

Spark Deployment modes (Class -42)

Spark applications can be deployed and executed using spark-submit in a shell command on a cluster. It can use any of the cluster managers like YARN, Mesos or its own Cluster manager through its uniform interface and there is no extra configuration needed for each one of them separately.

For deploying our Spark application on a cluster, we need to use the spark-submit script of Spark.

Spark has to do the process:

                 Spark-submit  --class classname mode  jar inputfile outputfile location



The spark-submit command is a utility to run or submit a Spark or PySpark application program (or job) to the cluster by specifying options and configurations, the application you are submitting can be written in Scala, Java, or Python (PySpark) code. You can use this utility in order to do the following.

1.    Submitting Spark application on different cluster managers like Yarn, Kubernetes, Mesos, and Stand-alone.

2.    Submitting Spark application on client or cluster deployment modes

Spark application needs to be deployed into 3 modes:

a)      Local à Spark itself allocates the Resources (Standalone).

b)      YARN Client à Driver will be running in Edge node. (Dev)

c)       YARN Cluster à Driver will be running in any one of the Data Nodes. (Prod)

We need to create Spark environment for IDEs like Eclipse & IntelliJ Idea. After creating the environment, we need to write the code into Executable File (Jar) using Build tool (MAVEN). Finally, take the Jar file and put it into the Cluster.