"Hadoop" Data serialization system Avro

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Avro Introduction
- Schema
File composition
- Header and DataBlock declaration code
- Test code
Serialization and deserialization
- Specific
- Generic
Resources

Avro Introduction

Avro is a data serialization system created by Doug Cutting (the father of Hadoop) designed to address the lack of writeable types: language portability. To support cross-language, Avro's schema is independent of the language's schema. See Official Document 1 for more features on Avro.

The reading and writing of Avro files is based on schema. Usually, the schema of Avro is written in JSON, while the data part is binary format encoding, and compression algorithm is used to compress the data in order to reduce the amount of transmission.

Schema

The type of data field in the schema includes two types of

native type (primitive types): null, Boolean, int, long, float, double, bytes, and string
Complex Type (complex types): Record, enum, array, map, union, and fixed

Complex types are more commonly used for record. Here, for example, in [2] Twitter.avro file, after opening the file, the file header is as follows:

objavro.codecnullavro.schemaò{"type": "Record", "Name": "Twitter_schema", "namespace": "Com.miguno.avro", "fields": [{ "Name": "username", "type": "string", "Doc": "Name of the user account on Twitter.com"},{"name": "Tweet", "type": "String", " Doc ":" The content of the user ' s Twitter message "},{" name ":" Timestamp "," type ":" Long "," Doc ":" Unix Epoch Time in Milliseconds "}]," Doc: ":" A basic schema for storing Twitter messages "}

After formatting the schema

{    "type":"Record",    "name":"Twitter_schema",    "namespace":"Com.miguno.avro",    " Fields":[        {            "name":"username", "type":"string",            "Doc":"Name         of the user account on Twitter.com"},        {            "name":"tweet", "type":"string",            "Doc":"The content of the user ' s Twitter message"        },        {            "name":"Timestamp", "type":"Long",            "Doc":"Unix epoch Time in milliseconds"        }    ],    "Doc:":"A Basic schema fostoring Twitter messages"}

Where name is the name of the JSON string, type indicates the name, and Doc is a more detailed description of the name.

The diagram in the file composition 3 describes the Avro file in detail, and a file consists of a header and multiple data blocks. Header mainly by MetaDatasWith 16-bit sync markerComposed, the information contained in the Metadatas codecAnd Schema;codec are the compression methods used in data block to null(Not compressed) or deflate。 The deflate algorithm is the compression algorithm used by gzip, and in my own sense, compression is more than 6 times times more than (specifically not studied).

In fact, each data block will be separated by a sync marker, specifically see 4. Sync marker is intended for file segmentation and synchronization in the MapReduce phase, and Avro itself is designed for MapReduce.

Header and DataBlock declaration code

//org.apache.avro.file.datafilestream.java   Public Static Final  class Header {Schema schema; Map<string,byte[]> meta =NewHashmap<string,byte[]> ();Private transientList<string> metakeylist =NewArraylist<string> ();byte[] sync =New byte[Datafileconstants.sync_size];//byte[16]    Private Header() {}  }StaticClass DataBlock {Private byte[] data;Private LongNumEntries;Private intBlockSize;Private intoffset =0;Private BooleanFlushonwrite =true;Private DataBlock(LongNumEntries,intBlockSize) { This. data =New byte[BlockSize]; This. numentries = NumEntries; This. blockSize = blockSize; }

Test code

datafilereader<void> reader =NewDatafilereader<void> (NewFsinput (NewPath ("Twitter.avro"),NewConfiguration ()),NewGenericdatumreader<void> ());//print SchemaSystem.out.println (Reader.getschema (). ToString (true));//print Metalist<string> metakeylist = Reader.getmetakeys (); System.out.println (Metakeylist.tostring ()); System.out.println (Reader.getmetastring ("Avro.codec")); System.out.println (Reader.getmetastring ("Avro.schema"));//print BlockountReader.getblockcount ();//print The data in data blockSystem.out.println (Reader.next ());

You can see that avro.codec is stored in meta, avro.schema.

Serialization and deserialization

Two kinds of serialization methods are given on the official website: specific and generic.

Specific

//Serialize user1, User2 and User3 to diskDatumwriter<user> Userdatumwriter =NewSpecificdatumwriter<user> (user.class);D atafilewriter<user> datafilewriter =NewDatafilewriter<user> (Userdatumwriter);d atafilewriter.create (User1.getschema (),NewFile ("Users.avro"));d Atafilewriter.append (user1);d atafilewriter.append (user2);d atafilewriter.append (user3);d Atafilewriter.close ();//Deserialize Users from diskDatumreader<user> Userdatumreader =NewSpecificdatumreader<user> (user.class);D atafilereader<user> Datafilereader =Newdatafilereader<user> (file, userdatumreader); User User =NULL; while(Datafilereader.hasnext ()) {//Reuse user object by passing it to next (). This saves us from//allocating and garbage collecting many objects for files with//many items.user = Datafilereader.next (user); SYSTEM.OUT.PRINTLN (user);}

The specific approach is to extract schema from the generated user class to parse the Avro.

Generic

Genericrecord user1 =NewGenericdata.record (schema); User1.put ("Name","Alyssa"); User1.put ("Favorite_number", the);//Leave favorite color nullGenericrecord user2 =NewGenericdata.record (schema); User2.put ("Name","Ben."); User2.put ("Favorite_number",7); User2.put ("Favorite_Color","Red");//Serialize user1 and user2 to diskFile File =NewFile ("Users.avro");D atumwriter<genericrecord> Datumwriter =NewGenericdatumwriter<genericrecord> (Schema);D atafilewriter<genericrecord> datafilewriter =NewDatafilewriter<genericrecord> (Datumwriter);d atafilewriter.create (schema, file);d Atafilewriter.append ( User1);d atafilewriter.append (user2);d atafilewriter.close ();

The way to generic is to pre-generate a schema and then parse it. Because the Avro file writes the schema to the file header, the generic approach is more common when parsing normally.

Avro-tools's jar package provides extensive operations on Avro files, including cutting Avro files for test data.

Available Tools:compile generates Java code for  the givenSchema. Concat concatenates Avro FileswithoutRe-compressing. Fragtojson renders a binary-encoded Avro datum asJson. Fromjson Reads JSON Records andWrites an AVRO datafile. Fromtext Imports Atext file  intoAn Avro datafile. Getmeta Prints out theMetadata ofAn Avro datafile. GetSchema Prints out Schema ofAn Avro datafile. IDL generates a JSON schema fromAn Avro IDLfileInduce induce schema/protocol fromJavaclass/interface via Reflection. Jsontofrag renders a json-encoded Avro datum asBinary. Recodec alters theCodec ofA datafile. RpcProtocol Output theProtocol ofA RPC service rpcreceive Opens an RPC Server andListens forOne message.       Rpcsend sends a single RPC message.       Tether Run a tethered mapreduce job. Tojson dumps an Avro datafile  asJSON, oneRecordPer line. Totext converts an Avro datafile  toAtext file. Trevni_meta dumps a Trevnifile' s metadata asJson.trevni_random Create a TrevnifileFilled withRandom instances ofA Schema.trevni_tojson dumps a trevnifile  asJson.

Resources

Apache Avro documentation. ?
Miguno, Avro-cli-examples. ?
Xyw_eliot, Avro Introduction. ?
Guibin, AVRO file structure analysis. ?

"Hadoop" Data serialization system Avro

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More