SK DATA SHARE: Spark Ecosystem (Class -40)

Apache Spark is an Open Source analytical framework for large scale powerful distributed data processing and machine learning applications. Spark has become a top-level Apache project since Feb 2014. Spark is a general-purpose, in-memory, fault-tolerant, distributed processing engine that allows you to process data efficiently 100x faster than traditional systems.

Using Spark we can process data from Hadoop HDFS, AWS S3, Databricks DBFS, Azure Blob Storage, and many file systems. Spark also is used to process real-time data using Streaming and Kafka.

Spark ecosystem consists of 5 tightly integrated components which are

Spark Core

Spark SQL

Spark Streaming

MLlib

GraphX

Spark Libraries	Entry / Starting Point	Data Structure	Used for
Spark Core	SparkContext	Resilient Distributed Data Set (RDD)	Batch Data
Spark SQL	SparkSession	Spark SQL Table (Dataframe)	Batch Data
Spark Streaming	Streaming Context	Dstream	Live / real time Data
MLlib	Machine Learning	RDD-based API	Data Science
GraphX	Property Graph	Extends the Spark RDD with a Resilient Distributed Property Graph.	Perform graph-parallel operations.

Data Engineers of Spark works on Structured and Semi-Structured Data (Spark SQL & Spark Streaming) while Data Scientist use Unstructured Data (MLlib & GraphX).

A) A) Spark Core:

The Spark Core is the heart of Spark and performs the core functionality.

It holds the components for task scheduling, fault tolerance (resilient), interacting with storage systems and memory management.

As we see RDD is the Data Structure of Spark Core, it collects the data from disk and distributed across the In-Memory of Data Nodes. RDD is immutable.

Here in RDD,

val is immutable variable.

sc is Spark Context.

File Home path is pwd in cloudera.

String is the datatype of RDD.

To print the data in RDD,

foreach is the higher order function of spark.

Operations in the RDD are called ‘Lazy’ because transformations get evaluated whenever the action gets triggered.

Transformation is transforming the data. (E.g. Map, Filter, Flat Map, Reduce).

Actions start evaluating the transformations. (E.g.:- foreach, SaveAsTextFile, count).

Go to Spark-shell Scala in Cloudera:

We can read csv file with the help of RDD textFile command.

We need to pick a csv file from existing data in cloudera using ls –lrt command which is here is mark.csv

Type ‘pwd’ to know the home path in Cloudera.

Use ‘foreach’ to print the data.

We can also use ‘collect()’ to visualize the data similar to foreach, but collect() display the data in the form of Array. Collect() command should not use for huge volumes of data.

B) Spark SQL:

Spark SQL is a module for structured data processing in Spark which integrates relational processing with Spark’s functional programming API. It supports querying data either via SQL or via the Hive Query Language.

C) Spark Streaming:

Spark Streaming component is a useful addition to the core Spark API. It is used to process real-time streaming data. It enables high-throughput and fault-tolerant stream processing of live data streams.

D) Spark MLlib:

MLlib stands are Machine Learning Library. Spark MLlib provides various machine learning algorithms such as classification, regression, clustering, and collaborative filtering. It also provides tools such as pipelines, persistence, and utilities for handling linear algebra operations, statistics and data handling.

E) Spark GraphX:

GraphX is the Spark API for graphs and graph-parallel computation. It extends the Spark RDD abstraction by introducing the Resilient Distributed Property Graph, a directed multigraph with properties attached to each vertex and edge.

SK DATA SHARE

Labels

Friday, April 29, 2022

Spark Ecosystem (Class -40)

No comments:

Post a Comment

About Me

Data Posts

Contact Form

Followers

Data Analytics (DA) Course

Wikipedia

Total Pageviews