SK DATA SHARE

Saturday, April 30, 2022

Spark Core RDD Operations (Class -41)

Resilient Distributed Dataset (RDDs) are created by starting with a file in the Hadoop file system (or any other Hadoop-supported file system), or an existing Scala collection in the driver program, and transforming it.

At a high level, every Spark application consists of a driver program that runs the user’s main function and executes various parallel operations on a cluster. The main abstraction Spark provides is a resilient distributed dataset (RDD), which is a collection of elements partitioned across the nodes of the cluster that can be operated on in parallel.

Apache Spark RDD supports two types of Operations-

· Transformations

· Actions

Spark Ecosystem (Class -40)

Apache Spark is an Open Source analytical framework for large scale powerful distributed data processing and machine learning applications. Spark has become a top-level Apache project since Feb 2014. Spark is a general-purpose, in-memory, fault-tolerant, distributed processing engine that allows you to process data efficiently 100x faster than traditional systems.

Using Spark we can process data from Hadoop HDFS, AWS S3, Databricks DBFS, Azure Blob Storage, and many file systems. Spark also is used to process real-time data using Streaming and Kafka.

Spark ecosystem consists of 5 tightly integrated components which are

Spark Core

Spark SQL

Spark Streaming

MLlib

GraphX

SPARK Introduction (Class -39)

Spark was introduced by Apache Software Foundation is a lightning-fast cluster computing technology, designed for speeding up the Hadoop computational computing software process.

Spark is not a modified version of Hadoop and not entirely dependent on Hadoop because it has its own cluster management. The main feature of Spark is its in-memory cluster computing that increases the processing speed of an application.

Spark is not an ecosystem of Hadoop as it can run individually. Spark uses Hadoop in two ways – one is Storage and second is Processing (MapReduce).

Hadoop frameworks are known for analyzing datasets based on a simple programming model (MapReduce) and main concern is to maintain speed in processing large datasets in terms of waiting time between queries and waiting time to run the program.

Earlier Hadoop Versions:

Hadoop 1.0 introduced in 2006 and used up-to 2012 until Hadoop 2.0 (YARN) came into the picture.

Main drawbacks of Hadoop 1.0 are

a) Single Point of Failure

b) Block Size

c) Relying on MapReduce (MR) [1970] for Resource management and processing engine.

In 2008, Cloudera becomes the commercial version of Hadoop which is open source and enterprise.

Spark is one of Hadoop’s sub project developed in 2009 in UC Berkeley’s AMPLab by Matei Zaharia when they are testing on Resource manager called Mesos Cluster, not for processing the data.

Spark was Open Sourced in 2010 under a BSD license. It was donated to Apache software foundation in 2013, and now Apache Spark has become a top level Apache project from Feb-2014.

Aggregate Functions in SQL (Part -7)

SQL aggregation function is used to perform the calculations on multiple rows of a single column of a table which returns a single value. It is also used to summarize the data.

We often use aggregate functions with the GROUP BY, WHERE and HAVING clauses of the SELECT statement.

1) SUM:

Sum function is used to calculate the sum of all selected columns. It works on numeric fields only.

2) COUNT:

COUNT function is used to Count the number of rows in a database table. It can work on both numeric and non-numeric data types.

3) MAX:

MAX function is used to find the maximum value of a certain column. This function determines the largest value of all selected values of a column.

4) MIN:

MIN function is used to find the minimum value of a certain column. This function determines the smallest value of all selected values of a column.

5) AVG:

The AVG function is used to calculate the average value of the numeric type. AVG function returns the average of all non-Null values.

Thursday, April 21, 2022

Operators in SQL (Part -6)

SQL Operators are special Words or Characters used to perform specific tasks both mathematical and logical computations on operands, which use ‘WHERE’ clause in a SQL query / statement.

There are six types of SQL operators that we are going to cover: Arithmetic, Bitwise, Comparison, Compound, Logical and String.

Every database administrator and user uses SQL queries for manipulating and accessing the data of database tables and views with the help of reserved words and characters, which are used to perform arithmetic operations, logical operations, comparison operations, compound operations, etc.

SQL Operators	Description
Arithmetic	Add (+), Subtract (-), Multiply (*), Divide (/), Modulo (%)
Bitwise	AND (&), OR (\|), exclusive OR (^)
Comparison	Equal to (=), Greater than (>), Less than (<), Greater than or equal to (>=), Less than or equal to (<=), Not equal to (<>)
Compound	Add equals (+=), Subtract equals (-=), Multiply equals (=), Divide equals (/=), Modulo equals (%=), Bitwise AND equals (&=), Bitwise exclusive equals (^-=), Bitwise OR equals (\|=)

SK DATA SHARE

Labels

Saturday, April 30, 2022

Spark Core RDD Operations (Class -41)

Friday, April 29, 2022

Spark Ecosystem (Class -40)

Sunday, April 24, 2022

SPARK Introduction (Class -39)

Saturday, April 23, 2022

Aggregate Functions in SQL (Part -7)

Thursday, April 21, 2022

Operators in SQL (Part -6)

About Me

Data Posts

Contact Form

Followers

Data Analytics (DA) Course

Wikipedia

Total Pageviews