google.com, pub-4600324410408482, DIRECT, f08c47fec0942fa0 SK DATA SHARE

Labels

Sunday, May 1, 2022

Temp Tables in SQL (Part -8)

As its name indicates, temporary tables are used to store data temporarily and they can perform CRUD (Create, Read, Update, and Delete).

Temp table will not exist once application has been closed.

Temporary tables are dropped when the session that creates the table has closed, or can also be explicitly dropped by users. At the same time, temporary tables can act like physical tables in many ways, which gives us more flexibility. Such as, we can create constraints, indexes, or statistics in these tables. SQL Server provides two types of temporary tables according to their scope:

  • Local Temporary Table
  • Global Temporary Table

Saturday, April 30, 2022

Spark Core RDD Operations (Class -41)

Resilient Distributed Dataset (RDDs) are created by starting with a file in the Hadoop file system (or any other Hadoop-supported file system), or an existing Scala collection in the driver program, and transforming it.

At a high level, every Spark application consists of a driver program that runs the user’s main function and executes various parallel operations on a cluster. The main abstraction Spark provides is a resilient distributed dataset (RDD), which is a collection of elements partitioned across the nodes of the cluster that can be operated on in parallel. 

Apache Spark RDD supports two types of Operations-

·         Transformations

·         Actions

 


Friday, April 29, 2022

Spark Ecosystem (Class -40)

 Apache Spark is an Open Source analytical framework for large scale powerful distributed data processing and machine learning applications. Spark has become a top-level Apache project since Feb 2014. Spark is a general-purpose, in-memory, fault-tolerant, distributed processing engine that allows you to process data efficiently 100x faster than traditional systems.

Using Spark we can process data from Hadoop HDFSAWS S3Databricks DBFSAzure Blob Storage, and many file systems. Spark also is used to process real-time data using Streaming and Kafka.


Spark ecosystem consists of 5 tightly integrated components which are

  Spark Core

 Spark SQL

Spark Streaming

 MLlib

 GraphX

Sunday, April 24, 2022

SPARK Introduction (Class -39)

Spark was introduced by Apache Software Foundation is a lightning-fast cluster computing technology, designed for speeding up the Hadoop computational computing software process.

Spark is not a modified version of Hadoop and not entirely dependent on Hadoop because it has its own cluster management. The main feature of Spark is its in-memory cluster computing that increases the processing speed of an application.

Spark is not an ecosystem of Hadoop as it can run individually. Spark uses Hadoop in two ways – one is Storage and second is Processing (MapReduce)

Hadoop frameworks are known for analyzing datasets based on a simple programming model (MapReduce) and main concern is to maintain speed in processing large datasets in terms of waiting time between queries and waiting time to run the program.


Earlier Hadoop Versions:

Hadoop 1.0 introduced in 2006 and used up-to 2012 until Hadoop 2.0 (YARN) came into the picture.

Main drawbacks of Hadoop 1.0 are

a)    Single Point of Failure

b)    Block Size

c)    Relying on MapReduce (MR) [1970] for Resource management and processing engine.

In 2008, Cloudera becomes the commercial version of Hadoop which is open source and enterprise.

Spark is one of Hadoop’s sub project developed in 2009 in UC Berkeley’s AMPLab by Matei Zaharia when they are testing on Resource manager called Mesos Cluster, not for processing the data.

Spark was Open Sourced in 2010 under a BSD license. It was donated to Apache software foundation in 2013, and now Apache Spark has become a top level Apache project from Feb-2014.

Saturday, April 23, 2022

Aggregate Functions in SQL (Part -7)

 SQL aggregation function is used to perform the calculations on multiple rows of a single column of a table which returns a single value. It is also used to summarize the data.

We often use aggregate functions with the GROUP BY, WHERE and HAVING clauses of the SELECT statement.


1)     SUM:

Sum function is used to calculate the sum of all selected columns. It works on numeric fields only.


2)  COUNT:

COUNT function is used to Count the number of rows in a database table. It can work on both numeric and non-numeric data types.


3) MAX:

MAX function is used to find the maximum value of a certain column. This function determines the largest value of all selected values of a column.


4) MIN:

MIN function is used to find the minimum value of a certain column. This function determines the smallest value of all selected values of a column.


5)     AVG:

 The AVG function is used to calculate the average value of the numeric type. AVG function returns the average of all non-Null values.