google.com, pub-4600324410408482, DIRECT, f08c47fec0942fa0 SK DATA SHARE

Labels

Friday, June 3, 2022

“WHERE”, “AND”, “OR” Clauses in SQL (Part -10)

 WHERE clause in SQL is a data manipulation language statement. WHERE clause is used in SELECT, UPDATE, DELETE statement etc.

WHERE clauses are not mandatory clauses of SQL DML statements. But it can be used to limit the number of rows affected by a SQL DML statement or returned by a query.


SQL Group By Statement (Part -9)

 Grouping the similar data is called ‘Group By’.

The SQL GROUP BY clause is used in collaboration with the SELECT statement to arrange identical data into groups. This GROUP BY clause follows the WHERE clause in a SELECT statement and precedes the ORDER BY clause.

The GROUP BY statement is often used with aggregate functions (COUNT()MAX()MIN()SUM()AVG()) to group the result-set by one or more columns.

Display the count of each and every country.

Syntax:


Wednesday, May 18, 2022

Spark SQL Introduction (Class -45)

Spark SQL is the powerful library to run SQL queries to process the data. 

Spark Session is the starting point in Spark SQL which has required libraries (APIs) to process the data to SQL by using a data structure called ‘DataFrame’ which is nothing but a Spark Table (we cannot write SQL queries). We need to register or convert the DataFrame to Spark SQL Table to process the data through SQL queries.

 In Spark DataFrame, we use ‘df’ as command with immutable variable ‘val df’. While displaying the data, we use ‘df.show()’, instead of ‘println’. ‘df show()’ command display maximum of 20 lines.


Monday, May 16, 2022

Spark Broadcast Accumulators (Class -44)

Shared variables are second abstraction in Spark after RDDs that can be used in parallel operations. By default, when Spark runs a function in parallel as a set of tasks on different nodes, it ships a copy of each variable used in the function to each task. Sometimes, a variable needs to be shared across tasks, or between tasks and the driver program.

 Spark supports two types of shared variables: Broadcast variables, which can be used to cache a value in memory on all nodes, and Accumulators, which are variables that are only “added” to, such as counters and sums.


Spark Architecture (Class -43)

 Apache Spark is an open-source cluster-computing framework for real time processing which is 100 times faster in memory and 10 times faster on disk when compared to Apache Hadoop.

Apache Spark has a well-defined architecture integrated with various extensions and libraries where all the spark components and layers are loosely coupled.

Spark is a distributed processing engine and it follows the Master-Slave architecture. So, for every Spark Application, it will create one master process and multiple slave processes.

When you run a Spark application, Spark Driver creates a context that is an entry point to your application, and all operations (transformations and actions) are executed on worker nodes, and the resources are managed by Cluster Manager.

Features of Apache Spark: