Labels

Saturday, February 5, 2022

BIG DATA Introduction (Class -1)

Big Data is a technology which handles huge amount of Data which are of  3 types - Structured, Semi-structured & Unstructured.

Data is processed n information which can be Kilobytes (KBs) / Gigabytes (GBs) / Terabytes (TB) / Petabyte (PB) / Exabyte (EB) / Zettabyte (ZB) / Utabytes (UBs).

Approximately, 55 UBs are being generated every day.

Entire Data Science usually handles and divided among:

A) Data Engineers

B) Data Analysts

C) Data Scientist

Data Engineer Should Know

   a) Data Ingestion Tool (KAFKA / Spark / TALEND / SQOOP / SOAP Services)
   b) Data Warehouse Framework (Hive)
   c) Data Processing Framework (Spark)
   d) No SQL DB (Hbase / Cassandra / MangoDB / DynamoDB / Cosmos DB)
   e) SQL
   f) AWS, AZURE, GCP
   g) Distribution Processing (HDFS)


Historically, We have been storing Data in various ways:

1960s - Papers / Notepads

1970s - DBMS (Data Base Management System) e.g.:- EXCEL

1980s - RDBMS [Server Box] which is Relational Data Base Management System which later used for Analysis of data using SQL, but Physical division of Data is not possible in RDBMS, only Logical Partition.

1990s - Data House [group of RDBMS servers] in storing bulk information.

1995 - Mr. Doug Cutting invented process locality called DFS [Distributed File System].

2003 - Google released GFS [Google File System], a scalable DFS.

2004 - MapReduce [MR] popularized by Jeffery Dean and Sanjay Ghemawat of Google.
           MR is a generic programming model for processing large data sets with a parallel and distributed algorithm on a cluster.

2005 - Data Locality which is the heart of MapReduce was created by Mr. Doug Cutting.

DATA TYPES

EXAMPLES

Structured

Tables [Rows & Columns]

Semi-Structured

JSON, XML, CSV [No Schema]

Unstructured

Twitter Logs, Whatsapp chats, Audio files, Video files

We store Company input Data in Cloud / RDBMS server, etc.  which is called "Source Layer".

We need to use SQL (Structured Query Language) to process data in RDBMS server [Structured Table Data].

DATA STORAGE

SOURCE LAYER

My SQL    RDBMS Server

Official Website Company Signup Registration

CSV file

Telephone File

JSON

Facebook social media

XML

Twitter

RDBMS works on Process Locality which works on small amounts of data is Server based Storage.

RDBMS server only used for Structured data analysis use for OLTP [Online Transaction Processing] allows to Insert, Update & Delete Data [DML operations] .

Hadoop is the framework which can handle only Structured and Semi-structured data while Spark ML framework can handle Unstructured data only.

Frameworks which follow Master Slave Architecture for Distributing Data are

     HADOOP

     HIVE

     SPARK

     PIG

Data Warehouse is of 3 types:

a) Hot Data (frequently used)

b) Warm Data ( less frequent data)

c) Cold Data (historical data)


3 Phases of Big Data:

a) Ingestion Phase [Source layer, SQOOP]

b) Enrichment Phase [HADOOP]

c) Extraction Phase [HBase]

3 Layers for Big Data process:

a) Source Layer

b) Landing Layer

c) Presentation Layer


Landing Layer is Cloud / AWS S3 / Azure ADLS.

There are 5 Vs of Big Data:

a) Volume

b) Velocity

c) Variety

d) Veracity

e) Value


Simple Architecture of Big Data: 




No comments:

Post a Comment