Apache Spark is an Open Source analytical framework for large scale powerful distributed data processing and machine learning applications. Spark has become a top-level Apache project since Feb 2014. Spark is a general-purpose, in-memory, fault-tolerant, distributed processing engine that allows you to process data efficiently 100x faster than traditional systems.
Using Spark we can process data from Hadoop HDFS, AWS
S3, Databricks DBFS, Azure Blob Storage, and many
file systems. Spark also is used to process real-time data using Streaming and Kafka.
Spark ecosystem consists of 5 tightly integrated components which are
Spark Core
Spark SQL
Spark Streaming
MLlib
GraphX
Spark
Libraries |
Entry /
Starting Point |
Data
Structure |
Used for
|
Spark
Core |
SparkContext |
Resilient
Distributed Data Set (RDD) |
Batch
Data |
Spark SQL |
SparkSession |
Spark
SQL Table (Dataframe) |
Batch
Data |
Spark
Streaming |
Streaming
Context |
Dstream |
Live /
real time Data |
MLlib |
Machine
Learning |
RDD-based
API |
Data
Science |
GraphX |
Property
Graph |
Extends
the Spark RDD with a Resilient Distributed Property Graph. |
Perform
graph-parallel operations. |
A) A) Spark Core:
The Spark Core is
the heart of Spark and performs the core functionality.
It holds the
components for task scheduling, fault tolerance (resilient), interacting with
storage systems and memory management.
As we see RDD is the Data
Structure of Spark Core, it collects the data from disk and distributed across
the In-Memory of Data Nodes. RDD is immutable.
Here in RDD,
val is immutable variable.
sc is Spark Context.
File Home path is pwd in cloudera.
String is the datatype of RDD.
To print the data in RDD,
foreach is the higher order function of spark.
Operations in the RDD are called ‘Lazy’ because transformations get
evaluated whenever the action gets triggered.
Transformation is transforming the data. (E.g. Map, Filter, Flat Map,
Reduce).
Actions start evaluating the transformations. (E.g.:- foreach, SaveAsTextFile, count).
Go to Spark-shell Scala in Cloudera:
We can read csv file with the help of RDD textFile command.
We need to pick a csv file from existing data in cloudera using ls
–lrt command which is here is mark.csv
Type ‘pwd’ to know the home path in Cloudera.
Use ‘foreach’ to print the data.
We can also use ‘collect()’
to visualize the data similar to foreach, but collect() display the
data in the form of Array. Collect() command should not use for huge volumes of
data.
B) Spark SQL:
Spark SQL is a module for structured data processing in Spark
which integrates relational processing with Spark’s functional programming API.
It supports querying data either via SQL or via the Hive Query Language.
C) Spark Streaming:
Spark Streaming component is a useful addition to the core
Spark API. It is used to process real-time streaming data. It enables
high-throughput and fault-tolerant stream processing of live data streams.
D) Spark MLlib:
MLlib stands are Machine Learning Library. Spark MLlib provides various machine learning algorithms such as
classification, regression, clustering, and collaborative filtering. It also
provides tools such as pipelines, persistence, and utilities for handling linear
algebra operations, statistics and data handling.
E) Spark GraphX:
GraphX is the Spark API for graphs and graph-parallel
computation. It extends the Spark RDD abstraction by introducing the Resilient
Distributed Property Graph, a directed multigraph with properties attached to
each vertex and edge.
No comments:
Post a Comment