Labels

Friday, April 29, 2022

Spark Ecosystem (Class -40)

 Apache Spark is an Open Source analytical framework for large scale powerful distributed data processing and machine learning applications. Spark has become a top-level Apache project since Feb 2014. Spark is a general-purpose, in-memory, fault-tolerant, distributed processing engine that allows you to process data efficiently 100x faster than traditional systems.

Using Spark we can process data from Hadoop HDFSAWS S3Databricks DBFSAzure Blob Storage, and many file systems. Spark also is used to process real-time data using Streaming and Kafka.


Spark ecosystem consists of 5 tightly integrated components which are

  Spark Core

 Spark SQL

Spark Streaming

 MLlib

 GraphX

Spark Libraries

Entry / Starting Point

Data Structure

Used for

Spark Core

SparkContext

Resilient Distributed Data Set (RDD)

Batch Data

Spark SQL

SparkSession

Spark SQL Table (Dataframe)

Batch Data

Spark Streaming

Streaming Context

Dstream

Live / real time Data

MLlib

Machine Learning

RDD-based API

Data Science

GraphX

Property Graph

Extends the Spark RDD with a Resilient Distributed Property Graph.

Perform graph-parallel operations.


Data Engineers of Spark works on Structured and Semi-Structured Data (Spark SQL & Spark Streaming) while Data Scientist use Unstructured Data (MLlib & GraphX).

A) A)  Spark Core:

  The Spark Core is the heart of Spark and performs the core functionality.

               It holds the components for task scheduling, fault tolerance (resilient), interacting with storage systems and memory management.

    As we see RDD is the Data Structure of Spark Core, it collects the data from disk and distributed across the In-Memory of Data Nodes. RDD is immutable.



Here in RDD,

    val is immutable variable.

    sc is Spark Context.

   File Home path is pwd in cloudera.

   String is the datatype of RDD.

 

To print the data in RDD,

       foreach is the higher order function of spark.



Operations in the RDD are called ‘Lazy’ because transformations get evaluated whenever the action gets triggered.

Transformation is transforming the data. (E.g. Map, Filter, Flat Map, Reduce).

Actions start evaluating the transformations. (E.g.:- foreach, SaveAsTextFile, count).


Go to Spark-shell Scala in Cloudera: 

We can read csv file with the help of RDD textFile command.



We need to pick a csv file from existing data in cloudera using ls –lrt command which is here is mark.csv

Type ‘pwd’ to know the home path in Cloudera.

Use ‘foreach’ to print the data.

 We can also use ‘collect()’ to visualize the data similar to foreach, but collect() display the data in the form of Array. Collect() command should not use for huge volumes of data.






B) Spark SQL:

Spark SQL is a module for structured data processing in Spark which integrates relational processing with Spark’s functional programming API. It supports querying data either via SQL or via the Hive Query Language.


C) Spark Streaming:

Spark Streaming component is a useful addition to the core Spark API. It is used to process real-time streaming data. It enables high-throughput and fault-tolerant stream processing of live data streams.


D) Spark MLlib:

MLlib stands are Machine Learning Library. Spark MLlib provides various machine learning algorithms such as classification, regression, clustering, and collaborative filtering. It also provides tools such as pipelines, persistence, and utilities for handling linear algebra operations, statistics and data handling.


E) Spark GraphX:

GraphX is the Spark API for graphs and graph-parallel computation. It extends the Spark RDD abstraction by introducing the Resilient Distributed Property Graph, a directed multigraph with properties attached to each vertex and edge.

No comments:

Post a Comment