SK DATA SHARE: Data Analytics

Showing posts with label Data Analytics. Show all posts

Wednesday, May 18, 2022

Spark SQL Introduction (Class -45)

Spark SQL is the powerful library to run SQL queries to process the data.

Spark Session is the starting point in Spark SQL which has required libraries (APIs) to process the data to SQL by using a data structure called ‘DataFrame’ which is nothing but a Spark Table (we cannot write SQL queries). We need to register or convert the DataFrame to Spark SQL Table to process the data through SQL queries.

In Spark DataFrame, we use ‘df’ as command with immutable variable ‘val df’. While displaying the data, we use ‘df.show()’, instead of ‘println’. ‘df show()’ command display maximum of 20 lines.

Spark Broadcast Accumulators (Class -44)

Shared variables are second abstraction in Spark after RDDs that can be used in parallel operations. By default, when Spark runs a function in parallel as a set of tasks on different nodes, it ships a copy of each variable used in the function to each task. Sometimes, a variable needs to be shared across tasks, or between tasks and the driver program.

Spark supports two types of shared variables: Broadcast variables, which can be used to cache a value in memory on all nodes, and Accumulators, which are variables that are only “added” to, such as counters and sums.

Spark Architecture (Class -43)

Apache Spark is an open-source cluster-computing framework for real time processing which is 100 times faster in memory and 10 times faster on disk when compared to Apache Hadoop.

Apache Spark has a well-defined architecture integrated with various extensions and libraries where all the spark components and layers are loosely coupled.

Spark is a distributed processing engine and it follows the Master-Slave architecture. So, for every Spark Application, it will create one master process and multiple slave processes.

When you run a Spark application, Spark Driver creates a context that is an entry point to your application, and all operations (transformations and actions) are executed on worker nodes, and the resources are managed by Cluster Manager.

Features of Apache Spark:

Spark Deployment modes (Class -42)

Spark applications can be deployed and executed using spark-submit in a shell command on a cluster. It can use any of the cluster managers like YARN, Mesos or its own Cluster manager through its uniform interface and there is no extra configuration needed for each one of them separately.

For deploying our Spark application on a cluster, we need to use the spark-submit script of Spark.

Spark has to do the process:

Spark-submit --class classname mode jar inputfile outputfile location

The spark-submit command is a utility to run or submit a Spark or PySpark application program (or job) to the cluster by specifying options and configurations, the application you are submitting can be written in Scala, Java, or Python (PySpark) code. You can use this utility in order to do the following.

1. Submitting Spark application on different cluster managers like Yarn, Kubernetes, Mesos, and Stand-alone.

2. Submitting Spark application on client or cluster deployment modes

Spark application needs to be deployed into 3 modes:

a) Local à Spark itself allocates the Resources (Standalone).

b) YARN Client à Driver will be running in Edge node. (Dev)

c) YARN Cluster à Driver will be running in any one of the Data Nodes. (Prod)

We need to create Spark environment for IDEs like Eclipse & IntelliJ Idea. After creating the environment, we need to write the code into Executable File (Jar) using Build tool (MAVEN). Finally, take the Jar file and put it into the Cluster.

Spark Core RDD Operations (Class -41)

Resilient Distributed Dataset (RDDs) are created by starting with a file in the Hadoop file system (or any other Hadoop-supported file system), or an existing Scala collection in the driver program, and transforming it.

At a high level, every Spark application consists of a driver program that runs the user’s main function and executes various parallel operations on a cluster. The main abstraction Spark provides is a resilient distributed dataset (RDD), which is a collection of elements partitioned across the nodes of the cluster that can be operated on in parallel.

Apache Spark RDD supports two types of Operations-

· Transformations

· Actions

Spark Ecosystem (Class -40)

Apache Spark is an Open Source analytical framework for large scale powerful distributed data processing and machine learning applications. Spark has become a top-level Apache project since Feb 2014. Spark is a general-purpose, in-memory, fault-tolerant, distributed processing engine that allows you to process data efficiently 100x faster than traditional systems.

Using Spark we can process data from Hadoop HDFS, AWS S3, Databricks DBFS, Azure Blob Storage, and many file systems. Spark also is used to process real-time data using Streaming and Kafka.

Spark ecosystem consists of 5 tightly integrated components which are

Spark Core

Spark SQL

Spark Streaming

MLlib

GraphX

SPARK Introduction (Class -39)

Spark was introduced by Apache Software Foundation is a lightning-fast cluster computing technology, designed for speeding up the Hadoop computational computing software process.

Spark is not a modified version of Hadoop and not entirely dependent on Hadoop because it has its own cluster management. The main feature of Spark is its in-memory cluster computing that increases the processing speed of an application.

Spark is not an ecosystem of Hadoop as it can run individually. Spark uses Hadoop in two ways – one is Storage and second is Processing (MapReduce).

Hadoop frameworks are known for analyzing datasets based on a simple programming model (MapReduce) and main concern is to maintain speed in processing large datasets in terms of waiting time between queries and waiting time to run the program.

Earlier Hadoop Versions:

Hadoop 1.0 introduced in 2006 and used up-to 2012 until Hadoop 2.0 (YARN) came into the picture.

Main drawbacks of Hadoop 1.0 are

a) Single Point of Failure

b) Block Size

c) Relying on MapReduce (MR) [1970] for Resource management and processing engine.

In 2008, Cloudera becomes the commercial version of Hadoop which is open source and enterprise.

Spark is one of Hadoop’s sub project developed in 2009 in UC Berkeley’s AMPLab by Matei Zaharia when they are testing on Resource manager called Mesos Cluster, not for processing the data.

Spark was Open Sourced in 2010 under a BSD license. It was donated to Apache software foundation in 2013, and now Apache Spark has become a top level Apache project from Feb-2014.

Access Modifiers in Scala (Class -38)

Access Modifiers in scala are used to define the access field of members of packages, classes or objects in scala. For using an access modifier, you must include its keyword in the definition of members of package, class or object. These modifiers will restrict accesses to the members to specific regions of code.

There are 3 types of Access Modifiers in Scala:

a) Public

b) Private

c) Protected

Singleton Object in Scala (Class -37)

Singleton Object is another Class of only One instance which can be created using keyword called ‘object’. An object that can exist without a class is a singleton object. It objects keyword is used to define it.

A Scala object is a singleton object that is accessible to any Scala code that has visibility of that object definition. The term singleton here refers to the fact that there is a single instance of the object definition within the virtual machine executing the Scala program. This is guaranteed by the language itself and does not require any additional programmer intervention.

Objects can • extend classes, • mix in traits, • define methods and functions • as well as properties (both vals and vars). • But they cannot define constructors.

Class and Objects in Scala (Class -36)

Class is nothing but blueprint for creating an object.

e.g.:- variables, function objects, methods, etc.

A Class is one of the basic building blocks of Scala. Classes act as templates which are used to construct instances. Classes allow programmers to specify the structure of an instance (i.e. its instance variables or fields) and the behaviour of an instance (i.e. its methods and functions) separately from the instance itself. This is important, as it would be extremely time-consuming (as well as inefficient) for programmers to define each instance individually. Instead, they define classes and create instances of those classes.

Currying function in Scala (Class -35)

The name Currying may seem obscure but the technique is named after Haskell Curry (for whom the Haskell programming language is named). Grouping of Parameters together is called Currying.

A currying function is a transforming function with multiple arguments transformed into single arguments. A currying function takes two arguments into a function that takes only a single argument.

There are two syntaxes to define the currying functions in Scala.

def functionName (arg1) = (arg2) => operation

def functionName(arg1) (arg2) = operation

In the first syntax, the function takes arg1 which equals arg2 and then the operation is performed.

The first single argument is the original function argument. This function returns another function that takes the second of the original function. This chaining continuous for all arguments of the function.

The last function in this chain does the actual word of the function call.

Higher Order Functions (HOF) in Scala (Class -34)

A Function that takes a function as a parameter is referred to as HOF. In other words, passing a function to another function is called Higher Order Function.

In Scala higher-order functions are functions that do at least one of the following (and may do both):

• Takes one or more functions as arguments (parameters).

• Return as output a function.

Example:

def f1 = println(“I am god”)

def f2(f:Unit) = f

f2(f1)

Here f1 is default first class function.

If f2 can take f1 as an argument, then f2 is HOF.

Functional Programming (FP) in Scala (Class -33)

When programs get larger, you need some way to divide them into smaller, more manageable pieces. For dividing up control flow, Scala offers an approach familiar to all experienced programmers: divide the code into functions.

In fact, Scala offers several ways to define functions that are not present in Java.

The most common way to define a function is as a member of some object; such a function is called a ‘method’.

The main function in Scala is defined as,

def main(args: Array[String]) {

Functions in Scala are called First Class Citizens. Not only can you define functions and call them, but you can write down functions as unnamed literals and then pass them around as values. In other words, If you can treat a Function as a Value, it is a First Class Function.

a) We can assign an object to a function à function object.

b) We can assign one function object to another function object.

c) Function object can be passed as a parameter to another function / method.

d) Functions object are returned from a method or function.

Point’s c & d are Higher Order Functions.

Match Expressions in Scala (Class -32)

Scala has a concept of a match expression. This is also called “Pattern Matching”.

Here, “match” keyword is used instead of switch statement. “Match” is always defined in Scala’s root class to make its availability to the all objects. This can contain a sequence of alternatives. Each alternative will start from case keyword. Each case statement includes a pattern and one or more expression which get evaluated if the specified pattern gets matched. To separate the pattern from the expressions, arrow symbol (=>) is used.

Collections in Scala (Class -31)

A Collection in programming is a simple object used to collect data. It groups together elements into a single entity (object). You can do the operation is to add, delete, update data using collections. The Collection in Scala can be mutable as well as immutable.

A Mutable collection is a collection whose elements can be updated and elements are added or removed from it. It allows all these operations.

An Immutable collection does not allow the user to do the update operation or add and remove operation on it. There is an option to do this, but on every operation, a new collection is created with updated value and the old one is discarded.

Scala collections have a rich hierarchy. The traversable trait is at the root of Scala hierarchy, all classes inherit some traits that are required for the general functioning of the collections.

Operators in Scala (Class -30)

An Operator is a symbol that represents an operation that is to be performed in the program.

Operators tell the compiler to perform a specific operation; each operator is associated with one and has a unique identification code. Operators play an important role in programming and they are used to make an expression that executes to perform a task.

Scala as a huge range of operators:

a) a) Arithmetic operators

b) Relational operators

c) Logical operators

d) Bitwise operators

e) Assignment operators

Data Types in Scala (Class -29)

In a programming language, a data type is also known as type, tells the compiler about the type of data that is used by the programmer.

The Scala data types are taken as it is from Java and the storage and length are the same. There are many different types of data types in Scala.

1 Byte = 8 Bits.

Data Types	Values	Storage	Default Value	Usage
Boolean	True / False	2 Bytes	FALSE	Only 2 values.
Integer	minus 2147483648 to 2147483647	4 bytes	0	Commonly used in programming.
Float	IEEE 754 single-precision float.	4 Bytes	0.0F	For the decimal point numbers.
Double	IEEE 754 double-precision float.	8 Bytes	0.0D	To handle decimal point numbers that needs larger value and more precision.
Char	0 to 216-1 Unicode	2 Bytes	\u000'	To handle character assignments stored as unsigned Unicode characters.
String	any length	40 Bytes (Empty String)	Null	Used to store a character sequence in a program.

Scala Identifiers (Class -28)

Identifiers in a programming language are the names given to a class, method, variable or object to identify them in the program.

e.g.:- myObject, main, args, value1, etc.

SK DATA SHARE

Labels

Wednesday, May 18, 2022

Spark SQL Introduction (Class -45)

Monday, May 16, 2022

Spark Broadcast Accumulators (Class -44)

Spark Architecture (Class -43)

Sunday, May 15, 2022

Spark Deployment modes (Class -42)

Saturday, April 30, 2022

Spark Core RDD Operations (Class -41)

Friday, April 29, 2022

Spark Ecosystem (Class -40)

Sunday, April 24, 2022

SPARK Introduction (Class -39)

Thursday, April 14, 2022

Access Modifiers in Scala (Class -38)

Singleton Object in Scala (Class -37)

Class and Objects in Scala (Class -36)

Currying function in Scala (Class -35)

Higher Order Functions (HOF) in Scala (Class -34)

Tuesday, April 12, 2022

Functional Programming (FP) in Scala (Class -33)

Match Expressions in Scala (Class -32)

Collections in Scala (Class -31)

Monday, April 11, 2022

Operators in Scala (Class -30)

Data Types in Scala (Class -29)

Scala Identifiers (Class -28)

About Me

Data Posts

Contact Form

Followers

Data Analytics (DA) Course

Wikipedia

Total Pageviews