- Avro Introduction
- File composition
- Header and DataBlock declaration code
- Test code
- Serialization and deserialization
- Resources
Avro Introduction
Avro is a data serialization system created by Doug Cutting (the father of Hadoop) designed to address the lack of writeable types: language portability. To support cross-language, Avro's schema is independent of the language's schema. See Official Document 1 for more features on Avro.
The reading and writing of Avro files is based on schema. Usually, the schema of Avro is written in JSON, while the data part is binary format encoding, and compression algorithm is used to compress the data in order to reduce the amount of transmission.
Schema
The type of data field in the schema includes two types of
- native type (primitive types): null, Boolean, int, long, float, double, bytes, and string
- Complex Type (complex types): Record, enum, array, map, union, and fixed
Complex types are more commonly used for record. Here, for example, in [2] Twitter.avro file, after opening the file, the file header is as follows:
objavro.codecnullavro.schemaò{"type": "Record", "Name": "Twitter_schema", "namespace": "Com.miguno.avro", "fields": [{ "Name": "username", "type": "string", "Doc": "Name of the user account on Twitter.com"},{"name": "Tweet", "type": "String", " Doc ":" The content of the user ' s Twitter message "},{" name ":" Timestamp "," type ":" Long "," Doc ":" Unix Epoch Time in Milliseconds "}]," Doc: ":" A basic schema for storing Twitter messages "}
After formatting the schema
{ "type":"Record", "name":"Twitter_schema", "namespace":"Com.miguno.avro", " Fields":[ { "name":"username", "type":"string", "Doc":"Name of the user account on Twitter.com"}, { "name":"tweet", "type":"string", "Doc":"The content of the user ' s Twitter message" }, { "name":"Timestamp", "type":"Long", "Doc":"Unix epoch Time in milliseconds" } ], "Doc:":"A Basic schema fostoring Twitter messages"}
Where name is the name of the JSON string, type indicates the name, and Doc is a more detailed description of the name.
The diagram in the file composition 3 describes the Avro file in detail, and a file consists of a header and multiple data blocks. Header mainly by
MetaDatas
With 16-bit
sync marker
Composed, the information contained in the Metadatas
codec
And Schema;codec are the compression methods used in data block to
null
(Not compressed) or
deflate
。 The deflate algorithm is the compression algorithm used by gzip, and in my own sense, compression is more than 6 times times more than (specifically not studied).
In fact, each data block will be separated by a sync marker, specifically see 4. Sync marker is intended for file segmentation and synchronization in the MapReduce phase, and Avro itself is designed for MapReduce.
Header and DataBlock declaration code
//org.apache.avro.file.datafilestream.java Public Static Final class Header {Schema schema; Map<string,byte[]> meta =NewHashmap<string,byte[]> ();Private transientList<string> metakeylist =NewArraylist<string> ();byte[] sync =New byte[Datafileconstants.sync_size];//byte[16] Private Header() {} }StaticClass DataBlock {Private byte[] data;Private LongNumEntries;Private intBlockSize;Private intoffset =0;Private BooleanFlushonwrite =true;Private DataBlock(LongNumEntries,intBlockSize) { This. data =New byte[BlockSize]; This. numentries = NumEntries; This. blockSize = blockSize; }
Test code
datafilereader<void> reader =NewDatafilereader<void> (NewFsinput (NewPath ("Twitter.avro"),NewConfiguration ()),NewGenericdatumreader<void> ());//print SchemaSystem.out.println (Reader.getschema (). ToString (true));//print Metalist<string> metakeylist = Reader.getmetakeys (); System.out.println (Metakeylist.tostring ()); System.out.println (Reader.getmetastring ("Avro.codec")); System.out.println (Reader.getmetastring ("Avro.schema"));//print BlockountReader.getblockcount ();//print The data in data blockSystem.out.println (Reader.next ());
You can see that avro.codec is stored in meta, avro.schema.
Serialization and deserialization
Two kinds of serialization methods are given on the official website: specific and generic.
Specific
//Serialize user1, User2 and User3 to diskDatumwriter<user> Userdatumwriter =NewSpecificdatumwriter<user> (user.class);D atafilewriter<user> datafilewriter =NewDatafilewriter<user> (Userdatumwriter);d atafilewriter.create (User1.getschema (),NewFile ("Users.avro"));d Atafilewriter.append (user1);d atafilewriter.append (user2);d atafilewriter.append (user3);d Atafilewriter.close ();//Deserialize Users from diskDatumreader<user> Userdatumreader =NewSpecificdatumreader<user> (user.class);D atafilereader<user> Datafilereader =Newdatafilereader<user> (file, userdatumreader); User User =NULL; while(Datafilereader.hasnext ()) {//Reuse user object by passing it to next (). This saves us from//allocating and garbage collecting many objects for files with//many items.user = Datafilereader.next (user); SYSTEM.OUT.PRINTLN (user);}
The specific approach is to extract schema from the generated user class to parse the Avro.
Generic
Genericrecord user1 =NewGenericdata.record (schema); User1.put ("Name","Alyssa"); User1.put ("Favorite_number", the);//Leave favorite color nullGenericrecord user2 =NewGenericdata.record (schema); User2.put ("Name","Ben."); User2.put ("Favorite_number",7); User2.put ("Favorite_Color","Red");//Serialize user1 and user2 to diskFile File =NewFile ("Users.avro");D atumwriter<genericrecord> Datumwriter =NewGenericdatumwriter<genericrecord> (Schema);D atafilewriter<genericrecord> datafilewriter =NewDatafilewriter<genericrecord> (Datumwriter);d atafilewriter.create (schema, file);d Atafilewriter.append ( User1);d atafilewriter.append (user2);d atafilewriter.close ();
The way to generic is to pre-generate a schema and then parse it. Because the Avro file writes the schema to the file header, the generic approach is more common when parsing normally.
Avro-tools's jar package provides extensive operations on Avro files, including cutting Avro files for test data.
Available Tools:compile generates Java code for the givenSchema. Concat concatenates Avro FileswithoutRe-compressing. Fragtojson renders a binary-encoded Avro datum asJson. Fromjson Reads JSON Records andWrites an AVRO datafile. Fromtext Imports Atext file intoAn Avro datafile. Getmeta Prints out theMetadata ofAn Avro datafile. GetSchema Prints out Schema ofAn Avro datafile. IDL generates a JSON schema fromAn Avro IDLfileInduce induce schema/protocol fromJavaclass/interface via Reflection. Jsontofrag renders a json-encoded Avro datum asBinary. Recodec alters theCodec ofA datafile. RpcProtocol Output theProtocol ofA RPC service rpcreceive Opens an RPC Server andListens forOne message. Rpcsend sends a single RPC message. Tether Run a tethered mapreduce job. Tojson dumps an Avro datafile asJSON, oneRecordPer line. Totext converts an Avro datafile toAtext file. Trevni_meta dumps a Trevnifile' s metadata asJson.trevni_random Create a TrevnifileFilled withRandom instances ofA Schema.trevni_tojson dumps a trevnifile asJson.
Resources
- Apache Avro documentation. ?
- Miguno, Avro-cli-examples. ?
- Xyw_eliot, Avro Introduction. ?
- Guibin, AVRO file structure analysis. ?
Copyright NOTICE: This article for Bo Master original article, without Bo Master permission not reproduced.
"Hadoop" Data serialization system Avro