Labels

Wednesday, March 23, 2022

SerDe properties of Hive (Class -13)

 SerDe means Serializer and Deserializer. Hive uses SerDe and FileFormat to read and write table rows. Main use of SerDe interface is for IO operations. 

Process of compressing the huge Data and serialize it to binary value format before transfer through HDFS though Network is called “Serialization”.

Data transferred through Serialization process finally reach the HDFS in the form of AVRO file format / Parquet file format.

Binary values of data in HDFS are converted to Human readable format through Deserialization process.



Hive supports various file formats like CSV , TEXT, ORC , PARQUET etc. We can change the file formats using the SET FILEFORMAT statement.

a) Location where AVRO file exists.
b) Scheme location - AVRO Schema.



Now, we have AVRO file and AVRO schema on HDFS:


All Hive tables create on External table.


We need to put input and output both in Avro file format in hive.







Hive> show tables;

Avro_hive_tab

Hive_dml

Hive> describe formatted hive_dml;

Storing the file in ORC file format:


Similarly, we can create a table in the form of Parquet file format.

ALTER TABLE person SET SERDEPROPERTIES (‘serialization.encoding’=’GBK’);

·         ThriftSerDe: This SerDe is used to read/write Thrift serialized objects. The class file for the Thrift object must be loaded first.

·         DynamicSerDe: This SerDe also read/write Thrift serialized objects, but it understands Thrift DDL so the schema of the object can be provided at runtime. Also it supports a lot of different protocols, including TBinaryProtocol, TJSONProtocol, TCTLSeparatedProtocol (which writes data in delimited records).

Other Built-in SerDes are Avro, ORC, RegEx, Parquet, CSV, JsonSerDe, etc.

No comments:

Post a Comment