SK DATA SHARE: BIG DATA Introduction (Class -1)

Big Data is a technology which handles huge amount of Data which are of 3 types - Structured, Semi-structured & Unstructured.

Data is processed n information which can be Kilobytes (KBs) / Gigabytes (GBs) / Terabytes (TB) / Petabyte (PB) / Exabyte (EB) / Zettabyte (ZB) / Utabytes (UBs).

Approximately, 55 UBs are being generated every day.

Entire Data Science usually handles and divided among:

A) Data Engineers

B) Data Analysts

C) Data Scientist

Data Engineer Should Know

a) Data Ingestion Tool (KAFKA / Spark / TALEND / SQOOP / SOAP Services)

b) Data Warehouse Framework (Hive)

c) Data Processing Framework (Spark)

d) No SQL DB (Hbase / Cassandra / MangoDB / DynamoDB / Cosmos DB)

e) SQL

f) AWS, AZURE, GCP

g) Distribution Processing (HDFS)

Historically, We have been storing Data in various ways:

1960s - Papers / Notepads

1970s - DBMS (Data Base Management System) e.g.:- EXCEL

1980s - RDBMS [Server Box] which is Relational Data Base Management System which later used for Analysis of data using SQL, but Physical division of Data is not possible in RDBMS, only Logical Partition.

1990s - Data House [group of RDBMS servers] in storing bulk information.

1995 - Mr. Doug Cutting invented process locality called DFS [Distributed File System].

2003 - Google released GFS [Google File System], a scalable DFS.

2004 - MapReduce [MR] popularized by Jeffery Dean and Sanjay Ghemawat of Google.

MR is a generic programming model for processing large data sets with a parallel and distributed algorithm on a cluster.

2005 - Data Locality which is the heart of MapReduce was created by Mr. Doug Cutting.

DATA TYPES	EXAMPLES
Structured	Tables [Rows & Columns]
Semi-Structured	JSON, XML, CSV [No Schema]
Unstructured	Twitter Logs, Whatsapp chats, Audio files, Video files

We store Company input Data in Cloud / RDBMS server, etc. which is called "Source Layer".

We need to use SQL (Structured Query Language) to process data in RDBMS server [Structured Table Data].

DATA STORAGE	SOURCE LAYER
My SQL RDBMS Server	Official Website Company Signup Registration
CSV file	Telephone File
JSON	Facebook social media
XML	Twitter

RDBMS works on Process Locality which works on small amounts of data is Server based Storage.

RDBMS server only used for Structured data analysis use for OLTP [Online Transaction Processing] allows to Insert, Update & Delete Data [DML operations] .

Hadoop is the framework which can handle only Structured and Semi-structured data while Spark ML framework can handle Unstructured data only.

Frameworks which follow Master Slave Architecture for Distributing Data are

HADOOP

HIVE

SPARK

PIG

Data Warehouse is of 3 types:

a) Hot Data (frequently used)

b) Warm Data ( less frequent data)

c) Cold Data (historical data)

3 Phases of Big Data:

a) Ingestion Phase [Source layer, SQOOP]

b) Enrichment Phase [HADOOP]

c) Extraction Phase [HBase]

3 Layers for Big Data process:

a) Source Layer

b) Landing Layer

c) Presentation Layer

Landing Layer is Cloud / AWS S3 / Azure ADLS.

There are 5 Vs of Big Data:

a) Volume

b) Velocity

c) Variety

d) Veracity

e) Value

Simple Architecture of Big Data:

SK DATA SHARE

Labels

Saturday, February 5, 2022

BIG DATA Introduction (Class -1)

No comments:

Post a Comment

About Me

Data Posts

Contact Form

Followers

Data Analytics (DA) Course

Wikipedia

Total Pageviews