"Hadoop" Data serialization system Avro

Source: Internet
Author: User
Tags epoch time time in milliseconds unix epoch time

    • Avro Introduction
      • Schema
    • File composition
      • Header and DataBlock declaration code
      • Test code
    • Serialization and deserialization
      • Specific
      • Generic
    • Resources

Avro Introduction

Avro is a data serialization system created by Doug Cutting (the father of Hadoop) designed to address the lack of writeable types: language portability. To support cross-language, Avro's schema is independent of the language's schema. See Official Document 1 for more features on Avro.

The reading and writing of Avro files is based on schema. Usually, the schema of Avro is written in JSON, while the data part is binary format encoding, and compression algorithm is used to compress the data in order to reduce the amount of transmission.

Schema

The type of data field in the schema includes two types of

    • native type (primitive types): null, Boolean, int, long, float, double, bytes, and string
    • Complex Type (complex types): Record, enum, array, map, union, and fixed

Complex types are more commonly used for record. Here, for example, in [2] Twitter.avro file, after opening the file, the file header is as follows:

objavro.codecnullavro.schemaò{"type": "Record", "Name": "Twitter_schema", "namespace": "Com.miguno.avro", "fields": [{ "Name": "username", "type": "string", "Doc": "Name of the user account on Twitter.com"},{"name": "Tweet", "type": "String", " Doc ":" The content of the user ' s Twitter message "},{" name ":" Timestamp "," type ":" Long "," Doc ":" Unix Epoch Time in Milliseconds "}]," Doc: ":" A basic schema for storing Twitter messages "}

After formatting the schema

{    "type":"Record",    "name":"Twitter_schema",    "namespace":"Com.miguno.avro",    " Fields":[        {            "name":"username", "type":"string",            "Doc":"Name         of the user account on Twitter.com"},        {            "name":"tweet", "type":"string",            "Doc":"The content of the user ' s Twitter message"        },        {            "name":"Timestamp", "type":"Long",            "Doc":"Unix epoch Time in milliseconds"        }    ],    "Doc:":"A Basic schema fostoring Twitter messages"}

Where name is the name of the JSON string, type indicates the name, and Doc is a more detailed description of the name.

The diagram in the file composition 3 describes the Avro file in detail, and a file consists of a header and multiple data blocks. Header mainly by MetaDatasWith 16-bit sync markerComposed, the information contained in the Metadatas codecAnd Schema;codec are the compression methods used in data block to null(Not compressed) or deflate。 The deflate algorithm is the compression algorithm used by gzip, and in my own sense, compression is more than 6 times times more than (specifically not studied).

In fact, each data block will be separated by a sync marker, specifically see 4. Sync marker is intended for file segmentation and synchronization in the MapReduce phase, and Avro itself is designed for MapReduce.

Header and DataBlock declaration code
//org.apache.avro.file.datafilestream.java   Public Static Final  class Header {Schema schema; Map<string,byte[]> meta =NewHashmap<string,byte[]> ();Private transientList<string> metakeylist =NewArraylist<string> ();byte[] sync =New byte[Datafileconstants.sync_size];//byte[16]    Private Header() {}  }StaticClass DataBlock {Private byte[] data;Private LongNumEntries;Private intBlockSize;Private intoffset =0;Private BooleanFlushonwrite =true;Private DataBlock(LongNumEntries,intBlockSize) { This. data =New byte[BlockSize]; This. numentries = NumEntries; This. blockSize = blockSize; }
Test code
datafilereader<void> reader =NewDatafilereader<void> (NewFsinput (NewPath ("Twitter.avro"),NewConfiguration ()),NewGenericdatumreader<void> ());//print SchemaSystem.out.println (Reader.getschema (). ToString (true));//print Metalist<string> metakeylist = Reader.getmetakeys (); System.out.println (Metakeylist.tostring ()); System.out.println (Reader.getmetastring ("Avro.codec")); System.out.println (Reader.getmetastring ("Avro.schema"));//print BlockountReader.getblockcount ();//print The data in data blockSystem.out.println (Reader.next ());

You can see that avro.codec is stored in meta, avro.schema.

Serialization and deserialization

Two kinds of serialization methods are given on the official website: specific and generic.

Specific
//Serialize user1, User2 and User3 to diskDatumwriter<user> Userdatumwriter =NewSpecificdatumwriter<user> (user.class);D atafilewriter<user> datafilewriter =NewDatafilewriter<user> (Userdatumwriter);d atafilewriter.create (User1.getschema (),NewFile ("Users.avro"));d Atafilewriter.append (user1);d atafilewriter.append (user2);d atafilewriter.append (user3);d Atafilewriter.close ();//Deserialize Users from diskDatumreader<user> Userdatumreader =NewSpecificdatumreader<user> (user.class);D atafilereader<user> Datafilereader =Newdatafilereader<user> (file, userdatumreader); User User =NULL; while(Datafilereader.hasnext ()) {//Reuse user object by passing it to next (). This saves us from//allocating and garbage collecting many objects for files with//many items.user = Datafilereader.next (user); SYSTEM.OUT.PRINTLN (user);}

The specific approach is to extract schema from the generated user class to parse the Avro.

Generic
Genericrecord user1 =NewGenericdata.record (schema); User1.put ("Name","Alyssa"); User1.put ("Favorite_number", the);//Leave favorite color nullGenericrecord user2 =NewGenericdata.record (schema); User2.put ("Name","Ben."); User2.put ("Favorite_number",7); User2.put ("Favorite_Color","Red");//Serialize user1 and user2 to diskFile File =NewFile ("Users.avro");D atumwriter<genericrecord> Datumwriter =NewGenericdatumwriter<genericrecord> (Schema);D atafilewriter<genericrecord> datafilewriter =NewDatafilewriter<genericrecord> (Datumwriter);d atafilewriter.create (schema, file);d Atafilewriter.append ( User1);d atafilewriter.append (user2);d atafilewriter.close ();

The way to generic is to pre-generate a schema and then parse it. Because the Avro file writes the schema to the file header, the generic approach is more common when parsing normally.

Avro-tools's jar package provides extensive operations on Avro files, including cutting Avro files for test data.

Available Tools:compile generates Java code for  the givenSchema. Concat concatenates Avro FileswithoutRe-compressing. Fragtojson renders a binary-encoded Avro datum asJson. Fromjson Reads JSON Records andWrites an AVRO datafile. Fromtext Imports Atext file  intoAn Avro datafile. Getmeta Prints out theMetadata ofAn Avro datafile. GetSchema Prints out Schema ofAn Avro datafile. IDL generates a JSON schema fromAn Avro IDLfileInduce induce schema/protocol fromJavaclass/interface via Reflection. Jsontofrag renders a json-encoded Avro datum asBinary. Recodec alters theCodec ofA datafile. RpcProtocol Output theProtocol ofA RPC service rpcreceive Opens an RPC Server andListens forOne message.       Rpcsend sends a single RPC message.       Tether Run a tethered mapreduce job. Tojson dumps an Avro datafile  asJSON, oneRecordPer line. Totext converts an Avro datafile  toAtext file. Trevni_meta dumps a Trevnifile' s metadata asJson.trevni_random Create a TrevnifileFilled withRandom instances ofA Schema.trevni_tojson dumps a trevnifile  asJson.
Resources
    1. Apache Avro documentation. ?
    2. Miguno, Avro-cli-examples. ?
    3. Xyw_eliot, Avro Introduction. ?
    4. Guibin, AVRO file structure analysis. ?

Copyright NOTICE: This article for Bo Master original article, without Bo Master permission not reproduced.

"Hadoop" Data serialization system Avro

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.