SK DATA SHARE: Spark Core RDD Operations (Class -41)

Resilient Distributed Dataset (RDDs) are created by starting with a file in the Hadoop file system (or any other Hadoop-supported file system), or an existing Scala collection in the driver program, and transforming it.

At a high level, every Spark application consists of a driver program that runs the user’s main function and executes various parallel operations on a cluster. The main abstraction Spark provides is a resilient distributed dataset (RDD), which is a collection of elements partitioned across the nodes of the cluster that can be operated on in parallel.

Apache Spark RDD supports two types of Operations-

· Transformations

· Actions

In Apache Spark, RDDs can be created in three ways.

a) Parallelize method by which already existing collection can be used in the driver program.

b) By referencing a dataset that is present in an external storage system such as HDFS, HBase, etc.

c) New RDDs can be created from an existing RDD.

Spark Transformation is a function that produces new RDD from the existing RDDs. It takes RDD as input and produces one or more RDD as output. Each time, it creates new RDD when we apply any transformation. Thus, the so input RDDs, cannot be changed since RDD are immutable in nature.

Applying transformation built an RDD lineage, also known as RDD operator graph or RDD dependency graph. It is a logical execution plan i.e., it is Directed Acyclic Graph (DAG) of the entire parent RDDs of RDD.

Transformations are lazy in nature i.e., they get execute when we call an action. They are not executed immediately. Two most basic type of transformations is a map(), filter().
After the transformation, the resultant RDD is always different from its parent RDD. It can be smaller (e.g. filter, count, distinct, sample), bigger (e.g. flatMap(), union(), Cartesian()) or the same size (e.g. map).

There are two types of transformations:

· Narrow transformation – In Narrow transformation, all the elements that are required to compute the records in single partition live in the single partition of parent RDD. A limited subset of partition is used to calculate the result. Narrow transformations are the result of map(), filter().

· Wide transformation – In wide transformation, all the elements that are required to compute the records in the single partition may live in many partitions of parent RDD. The partition may live in many partitions of parent RDD. Wide transformations are the result of groupbyKey() and reducebyKey().

Example of RDD function in Spark Core:

Item Sales Data in file of 4 fields as given below:

Now create file:

Scala> vi item_sales.csv

Insert above data in the file:

Now create RDD to Read the data in Local File System (Text File):

Here

val is immutable variable.

sc is Spark Context.

textFile to read the data.

Home Path is ‘pwd’ in Cloudera.

Item_sales.csv is the file created to store the data.

To check whether RDD got created or not, give ‘collect()’ command:

To know how many times the Item has been sold, we need to convert the data into Key Value Pairs as per the requirement using a Key which is Item_ID in the case.

To convert each and every item as Key Value Pair, we need to go to MapReduce step.