Labels

Wednesday, May 18, 2022

Spark SQL Introduction (Class -45)

Spark SQL is the powerful library to run SQL queries to process the data. 

Spark Session is the starting point in Spark SQL which has required libraries (APIs) to process the data to SQL by using a data structure called ‘DataFrame’ which is nothing but a Spark Table (we cannot write SQL queries). We need to register or convert the DataFrame to Spark SQL Table to process the data through SQL queries.

 In Spark DataFrame, we use ‘df’ as command with immutable variable ‘val df’. While displaying the data, we use ‘df.show()’, instead of ‘println’. ‘df show()’ command display maximum of 20 lines.



 In order to convert Spark Table (DataFrame) into Spark SQL Table, we need to give below command:

                                             “df.creatorOrReplaceTempView(“sparkSqlTbl”)”


Using Spark SQL, we can process different kinds of Data:



Spark SQL can Read and Write data of any file systems. After reading the data from files like csv, JSON, etc. data are stored into DataFrame where the data will be processed and final output will be saved in the Hive / HBASE / RDBMS, etc.

Using Spark SQL, we can process either Structured (RDBMS, Parquet, ORC, AVRO) or Semi-Structured Data (JSON, XML, CSV).

Spark Architecture consists of 3 main layers:

Language API: Spark is compatible and even supported by the languages like Python, HiveQL, Scala, and Java.

SchemaRDD: RDD (resilient distributed dataset) is a special data structure with which the Spark core is designed. As Spark SQL works on schema, tables, and records, you can use SchemaRDD or data frame as a temporary table.

Data Sources: For Spark core, the data source is usually a text file, Avro file, etc. Data sources for Spark SQL are different like JSON document, Parquet file, HIVE tables, and Cassandra database.

Components of Spark SQL Library:


1)      DataFrame à Spark Table (Data Structure) – High Level representation of Data

                                    RDD + Schema = DataFrame


2)      Data Source API (spark.read) à Universal API – I/O module of Spark SQL. This will Read the data from any file system and Write data to any file system.


3)      Catalyst Optimizer


4)      Tungsten – memory management (JVM memory)

 



Summary of Spark SQL steps:

                                                           Read - -> Process - -> Write








No comments:

Post a Comment