While working on data analysis and data processing we encounter various file formats. Avro is one of the file formats.
What is Avro?
Avro is a data serialization and RPC library. It is used to improve data interchange, interoperability, and versioning in MapReduce.
Avro was Created By Don Cutting.
Avro utilizes a compact Binary data format. This format provides options to compress resulting in fast serialization times.
Schema concept is similar to Protocol Buffers in Avro. It improves on Protocol Buffers because its code generation is optional. It also embeds the schema in the container file format, allowing for dynamic discovery and data interactions.
It has mechanisms that can work with schema data that uses generic data types.
Structure of Avro Files
The schema is serialized as part of header and data. Header makes deserialization simple. It also loosens restrictions for users having to maintain and access the data outside Avro data files being interacted with.
Each data block contains a number of Avro records, and default size is 16 KB.
Features of Avro
It supports code generation,versioning and compression.
It has a high level of integration with MapReduce.
Hadoop SequenceFiles aren’t appealing since they don’t support a schema or any form of data evolution. Avro has support for schema evolution.
Avro’s Schema and code generation
In a code-generated approach everything starts with a schema therefore the first step is to create an Avro schema.
Consider stock data:
To generate java code for a schema, use the Avro tools JAR as follows:
#create directory for the sources
#expand the source JAR into directory
jar -xvf ../hip-2.0.0-sources.jar
java -jar $HIP_HOME/lib/avro-tools-1.7.4.jar \
#let Avro generate classes for Avro schema
Compile schema \
#input schema file
#supports multiple schema files
More on working with Avro in upcoming post