SK DATA SHARE

Thursday, June 16, 2022

SPL Rank() Function (Part -13)

The RANK() function is also known as window function that assigns a rank to each row within a group of data sets. The rank of a row is determined by one plus the number of ranks that come before it.

The syntax of the RANK() function is as follows

SELECT column_name,

RANK() OVER (PARTITION BY... ORDER BY...) as rank

FROM table_name;

In this syntax:

The column_name represents the column that you wish to rank in the table

The PARTITION BY clause divides the result set's rows into partitions based on one or more parameters

The ORDER BY clause sorts the rows in each partition where the function is applied

RANK() function is used as part of the SELECT statement.

Basically, you add another column to your result set. This column includes the rank of each record based on the order specified after the ORDER BY keyword. This entails specifying (1) the column to use for sorting the rows and (2) whether the order should be ascending or descending.

The first row gets rank 1, and the following rows get higher rankings. If any rows have the same value in the column used for ordering, they are ranked the same. The RANK() function leaves gaps in such cases.

SQL SUB QUERY (Part -12)

Sub Query is a Query within another Query. There are 2 types of Sub Queries:

1) Single Row sub query

2) Multi Row sub query

In other words, A Subquery or Inner query or a Nested query is a query within another SQL query and embedded within the WHERE clause.

A subquery is used to return data that will be used in the main query as a condition to further restrict the data to be retrieved.

Subqueries can be used with the SELECT, INSERT, UPDATE, and DELETE statements along with the operators like =, <, >, >=, <=, IN, BETWEEN, etc.

A subquery may occur in :

- A SELECT clause
- A FROM clause
- A WHERE clause

SQL JOINS (Part -11)

SQL Join statement is used to combine data or rows from two or more tables based on a common field between them. We use SQL Joins to access data from multiple tables.

There are 6 types of Joins:

1) INNER JOIN

2) OUTER JOIN

3) LEFT OUTER JOIN

4) RIGHT OUTER JOIN

5) CROSS JOIN

6) SELF JOIN

“WHERE”, “AND”, “OR” Clauses in SQL (Part -10)

A WHERE clause in SQL is a data manipulation language statement. WHERE clause is used in SELECT, UPDATE, DELETE statement etc.

WHERE clauses are not mandatory clauses of SQL DML statements. But it can be used to limit the number of rows affected by a SQL DML statement or returned by a query.

SQL Group By Statement (Part -9)

Grouping the similar data is called ‘Group By’.

The SQL GROUP BY clause is used in collaboration with the SELECT statement to arrange identical data into groups. This GROUP BY clause follows the WHERE clause in a SELECT statement and precedes the ORDER BY clause.

The GROUP BY statement is often used with aggregate functions (COUNT(), MAX(), MIN(), SUM(), AVG()) to group the result-set by one or more columns.

Display the count of each and every country.

Syntax:

Spark SQL Introduction (Class -45)

Spark SQL is the powerful library to run SQL queries to process the data.

Spark Session is the starting point in Spark SQL which has required libraries (APIs) to process the data to SQL by using a data structure called ‘DataFrame’ which is nothing but a Spark Table (we cannot write SQL queries). We need to register or convert the DataFrame to Spark SQL Table to process the data through SQL queries.

In Spark DataFrame, we use ‘df’ as command with immutable variable ‘val df’. While displaying the data, we use ‘df.show()’, instead of ‘println’. ‘df show()’ command display maximum of 20 lines.

Spark Broadcast Accumulators (Class -44)

Shared variables are second abstraction in Spark after RDDs that can be used in parallel operations. By default, when Spark runs a function in parallel as a set of tasks on different nodes, it ships a copy of each variable used in the function to each task. Sometimes, a variable needs to be shared across tasks, or between tasks and the driver program.

Spark supports two types of shared variables: Broadcast variables, which can be used to cache a value in memory on all nodes, and Accumulators, which are variables that are only “added” to, such as counters and sums.

Spark Architecture (Class -43)

Apache Spark is an open-source cluster-computing framework for real time processing which is 100 times faster in memory and 10 times faster on disk when compared to Apache Hadoop.

Apache Spark has a well-defined architecture integrated with various extensions and libraries where all the spark components and layers are loosely coupled.

Spark is a distributed processing engine and it follows the Master-Slave architecture. So, for every Spark Application, it will create one master process and multiple slave processes.

When you run a Spark application, Spark Driver creates a context that is an entry point to your application, and all operations (transformations and actions) are executed on worker nodes, and the resources are managed by Cluster Manager.

Features of Apache Spark:

Spark Deployment modes (Class -42)

Spark applications can be deployed and executed using spark-submit in a shell command on a cluster. It can use any of the cluster managers like YARN, Mesos or its own Cluster manager through its uniform interface and there is no extra configuration needed for each one of them separately.

For deploying our Spark application on a cluster, we need to use the spark-submit script of Spark.

Spark has to do the process:

Spark-submit --class classname mode jar inputfile outputfile location

The spark-submit command is a utility to run or submit a Spark or PySpark application program (or job) to the cluster by specifying options and configurations, the application you are submitting can be written in Scala, Java, or Python (PySpark) code. You can use this utility in order to do the following.

1. Submitting Spark application on different cluster managers like Yarn, Kubernetes, Mesos, and Stand-alone.

2. Submitting Spark application on client or cluster deployment modes

Spark application needs to be deployed into 3 modes:

a) Local à Spark itself allocates the Resources (Standalone).

b) YARN Client à Driver will be running in Edge node. (Dev)

c) YARN Cluster à Driver will be running in any one of the Data Nodes. (Prod)

We need to create Spark environment for IDEs like Eclipse & IntelliJ Idea. After creating the environment, we need to write the code into Executable File (Jar) using Build tool (MAVEN). Finally, take the Jar file and put it into the Cluster.

Temp Tables in SQL (Part -8)

As its name indicates, temporary tables are used to store data temporarily and they can perform CRUD (Create, Read, Update, and Delete).

Temp table will not exist once application has been closed.

Temporary tables are dropped when the session that creates the table has closed, or can also be explicitly dropped by users. At the same time, temporary tables can act like physical tables in many ways, which gives us more flexibility. Such as, we can create constraints, indexes, or statistics in these tables. SQL Server provides two types of temporary tables according to their scope:

Local Temporary Table
Global Temporary Table

Spark Core RDD Operations (Class -41)

Resilient Distributed Dataset (RDDs) are created by starting with a file in the Hadoop file system (or any other Hadoop-supported file system), or an existing Scala collection in the driver program, and transforming it.

At a high level, every Spark application consists of a driver program that runs the user’s main function and executes various parallel operations on a cluster. The main abstraction Spark provides is a resilient distributed dataset (RDD), which is a collection of elements partitioned across the nodes of the cluster that can be operated on in parallel.

Apache Spark RDD supports two types of Operations-

· Transformations

· Actions

Spark Ecosystem (Class -40)

Apache Spark is an Open Source analytical framework for large scale powerful distributed data processing and machine learning applications. Spark has become a top-level Apache project since Feb 2014. Spark is a general-purpose, in-memory, fault-tolerant, distributed processing engine that allows you to process data efficiently 100x faster than traditional systems.

Using Spark we can process data from Hadoop HDFS, AWS S3, Databricks DBFS, Azure Blob Storage, and many file systems. Spark also is used to process real-time data using Streaming and Kafka.

Spark ecosystem consists of 5 tightly integrated components which are

Spark Core

Spark SQL

Spark Streaming

MLlib

GraphX

SPARK Introduction (Class -39)

Spark was introduced by Apache Software Foundation is a lightning-fast cluster computing technology, designed for speeding up the Hadoop computational computing software process.

Spark is not a modified version of Hadoop and not entirely dependent on Hadoop because it has its own cluster management. The main feature of Spark is its in-memory cluster computing that increases the processing speed of an application.

Spark is not an ecosystem of Hadoop as it can run individually. Spark uses Hadoop in two ways – one is Storage and second is Processing (MapReduce).

Hadoop frameworks are known for analyzing datasets based on a simple programming model (MapReduce) and main concern is to maintain speed in processing large datasets in terms of waiting time between queries and waiting time to run the program.

Earlier Hadoop Versions:

Hadoop 1.0 introduced in 2006 and used up-to 2012 until Hadoop 2.0 (YARN) came into the picture.

Main drawbacks of Hadoop 1.0 are

a) Single Point of Failure

b) Block Size

c) Relying on MapReduce (MR) [1970] for Resource management and processing engine.

In 2008, Cloudera becomes the commercial version of Hadoop which is open source and enterprise.

Spark is one of Hadoop’s sub project developed in 2009 in UC Berkeley’s AMPLab by Matei Zaharia when they are testing on Resource manager called Mesos Cluster, not for processing the data.

Spark was Open Sourced in 2010 under a BSD license. It was donated to Apache software foundation in 2013, and now Apache Spark has become a top level Apache project from Feb-2014.

Aggregate Functions in SQL (Part -7)

SQL aggregation function is used to perform the calculations on multiple rows of a single column of a table which returns a single value. It is also used to summarize the data.

We often use aggregate functions with the GROUP BY, WHERE and HAVING clauses of the SELECT statement.

1) SUM:

Sum function is used to calculate the sum of all selected columns. It works on numeric fields only.

2) COUNT:

COUNT function is used to Count the number of rows in a database table. It can work on both numeric and non-numeric data types.

3) MAX:

MAX function is used to find the maximum value of a certain column. This function determines the largest value of all selected values of a column.

4) MIN:

MIN function is used to find the minimum value of a certain column. This function determines the smallest value of all selected values of a column.

5) AVG:

The AVG function is used to calculate the average value of the numeric type. AVG function returns the average of all non-Null values.

Thursday, April 21, 2022

Operators in SQL (Part -6)

SQL Operators are special Words or Characters used to perform specific tasks both mathematical and logical computations on operands, which use ‘WHERE’ clause in a SQL query / statement.

There are six types of SQL operators that we are going to cover: Arithmetic, Bitwise, Comparison, Compound, Logical and String.

Every database administrator and user uses SQL queries for manipulating and accessing the data of database tables and views with the help of reserved words and characters, which are used to perform arithmetic operations, logical operations, comparison operations, compound operations, etc.

SQL Operators	Description
Arithmetic	Add (+), Subtract (-), Multiply (*), Divide (/), Modulo (%)
Bitwise	AND (&), OR (\|), exclusive OR (^)
Comparison	Equal to (=), Greater than (>), Less than (<), Greater than or equal to (>=), Less than or equal to (<=), Not equal to (<>)
Compound	Add equals (+=), Subtract equals (-=), Multiply equals (=), Divide equals (/=), Modulo equals (%=), Bitwise AND equals (&=), Bitwise exclusive equals (^-=), Bitwise OR equals (\|=)

SQL Constraints (Part -5)

SQL constraints are conditions / rules that apply on the data columns of a table which is used to limit the type of data that goes inside the table.

Constraints are used to specify the rules concerning data in the table. It can be applied for single or multiple fields in an SQL table during the creation of the table or after creating using the ALTER TABLE command.

There are 3 types of Constraints:

A) Key Constraint

B) Domain Constraint

C) Referential Integrity Constraint

The PRIMARY KEY constraint uniquely identifies each row in a table. It must contain UNIQUE values and has an implicit NOT NULL constraint.

A table in SQL is strictly restricted to have one and only one primary key, which is comprised of single or multiple fields (columns).

A UNIQUE constraint ensures that all values in a column are different. This provides uniqueness for the column(s) and helps identify each row uniquely. Unlike primary key, there can be multiple unique constraints defined per table. The code syntax for UNIQUE is quite similar to that of PRIMARY KEY and can be used interchangeably.

A FOREIGN KEY comprises of single or collection of fields in a table that essentially refers to the PRIMARY KEY in another table. Foreign key constraint ensures referential integrity in the relation between two tables.
The table with the foreign key constraint is labelled as the child table, and the table containing the candidate key is labelled as the referenced or parent table.

Create a Table in MySQL (Part -4)

A MySQL table stores and organizes data in columns and rows as defined during table creation.

In the process of creating a table, you need to specify the following information:

Column names – We are creating the title, genre, director, and release year columns for our table.
Varchar of the columns containing characters – Specifies the maximum number of characters stored in the column.
The integer of the columns containing numbers – Defines numeric variables holding whole numbers.
Not null rule – Indicates that each new record must contain information for the column.
Primary key – Sets a column that defines a record.

SQL in RDBMS (Part -3)

RDBMS (Relational Database Management System) is one of the types of DBMS which shows Relational between the Tables (where data are stored in the form of ROWS & COLUMNS).

In other words, Relational database is a type of database that allows us to identify and access data in relation to another piece of data in the database. It stores data in rows and columns in a series of tables to make processing and querying efficient.

Rows (Tuples) are Horizontal while Columns (Headers are called Attributes – ID, Name, etc.) are Vertical in a Table.

Degree of Relation = No. of Columns

Cardinality = No. of Rows

Data present inside the Column is called ‘Domain Values’.

SK DATA SHARE

Labels

Thursday, June 16, 2022

SPL Rank() Function (Part -13)

SQL SUB QUERY (Part -12)

Friday, June 3, 2022

SQL JOINS (Part -11)

“WHERE”, “AND”, “OR” Clauses in SQL (Part -10)

SQL Group By Statement (Part -9)

Wednesday, May 18, 2022

Spark SQL Introduction (Class -45)

Monday, May 16, 2022

Spark Broadcast Accumulators (Class -44)

Spark Architecture (Class -43)

Sunday, May 15, 2022

Spark Deployment modes (Class -42)

Sunday, May 1, 2022

Temp Tables in SQL (Part -8)

Saturday, April 30, 2022

Spark Core RDD Operations (Class -41)

Friday, April 29, 2022

Spark Ecosystem (Class -40)

Sunday, April 24, 2022

SPARK Introduction (Class -39)

Saturday, April 23, 2022

Aggregate Functions in SQL (Part -7)

Thursday, April 21, 2022

Operators in SQL (Part -6)

Wednesday, April 20, 2022

SQL Constraints (Part -5)

Monday, April 18, 2022

Create a Table in MySQL (Part -4)

SQL in RDBMS (Part -3)

About Me

Data Posts

Contact Form

Followers

Data Analytics (DA) Course

Wikipedia

Total Pageviews