Note: This is a good Chinese description of the previous blog.
Avro is a subproject of hadoop. It is developed by Doug cutting, founder of hadoop (also the founder of Lucene, nutch, and other projects. Avro is a data serialization system designed for applications that support mass data exchange. It supports binary serialization to process large amounts of data conveniently and quickly. Dynamic languages are friendly, and Avro provides a mechanism that allows dynamic languages to process Avro data conveniently.
There are many similar serialization systems on the market, such as Google's protocol buffers and Facebook's thrift. These systems have a good response and can fully meet the needs of common applications. Regarding repeated development concerns, Doug cutting wrote that: hadoop's existing RPC system encountered some problems, such as performance bottlenecks (currently using the IPC System, it uses the built-in dataoutputstream and datainputstream of Java ); you must run hadoop of the same version on the server and client. You can only use Java for development. However, the existing serialization systems are also faulty. Taking protocol buffers as an example, you need to first define the data structure, then generate code based on the data structure, and then assemble the data. If you need to operate on datasets of multiple data sources, you need to define multiple data sets and repeat the preceding process multiple times. In this way, you cannot perform unified processing on any data set. Second, it is unreasonable to use code generation for script systems such as hive and pig in hadoop. In addition, protocol buffers adds annotations to the data because the data definition may not exactly match the data during serialization, which makes the data huge and slows down processing. Other serialization systems have problems similar to Protocol buffers. Therefore, for the future of hadoop, Doug cutting led the development of a new serialization system. This is Avro, which was added to the hadoop project family in.
The above comparison with protocol buffers roughly shows Avro's expertise. The following focuses on the details of Avro.
Avro relies on schema to implement data structure definition. The pattern can be understood as a Java class, which defines the structure of each instance and the attributes that can be included. You can generate any number of instance objects based on the class. You must know the basic structure of the instance during serialization. You also need to refer to the class information. The Avro object generated by the mode is similar to the Instance Object of the class. You need to know the specific structure of the mode for each serialization/deserialization. Therefore, in some scenarios where Avro is available, such as file storage or network communication, both modes and data must exist simultaneously. Avro data is read and written in the mode (file or network), and the written data does not need to be added with other identifiers, so that the serialization speed is fast and the result content is small. Because the program can process data directly according to the mode, Avro is more suitable for scripting.
Avro's pattern is mainly represented by a JSON object. It may have certain attributes to describe different forms of a certain type. Avro supports eight basic types (primitive type) and six hybrid types (complex type ). The basic type can be represented by a JSON string. Different types of hybrid data are defined by different attributes. Some attributes are required and optional, JSON Arrays can be used to store multiple JSON object definitions. With the support of the various types defined by Avro, you can create a wide range of data structures to support complicated user data.
Avro supports binary encoding and JSON encoding. Binary encoding is highly serialized, And the serialized results are relatively small. JSON is generally used for debugging systems or web-based applications. For Avro data serialization/deserialization, the pattern must be executed in the descending order of depth-first (left-to-right) and left-to-right (left-to-right. Serialization of basic types is easy to solve, and there are many different rules for serialization of mixed types. Binary encoding of the basic and hybrid types is specified in the document, and bytes are arranged in the order of parsing by mode. For JSON encoding, union type is inconsistent with other mixed types.
Avro defines a container file format to facilitate mapreduce processing ). Only one mode is available for such files. All objects to be saved to this file must be written in binary encoding in this mode. Objects are organized in blocks in the file, and these objects can be compressed. A synchronization tag (synchronization marker) exists between the block and the block so that mapreduce can easily cut the file for processing. The structure diagram is drawn based on the document description:
Each block has been physically removed, but it is necessary to describe it in detail. A storage file consists of two parts: header information and data block ). The header information consists of four-byte prefixes (similar to magic number), the meta-data information of the file, and the randomly generated 16-byte synchronization token. The meta-data information here is confusing. What can it contain besides the file mode. This document indicates that Avro recognizes two meta-data: schema and codec. Here codec indicates the compression method used for the subsequent file data block. Avro must support the following two compression methods: NULL (not compressed) and deflate (using the deflate algorithm to compress data blocks ). In addition to the two types of meta-data identified in the document, you can also customize the meta-data that applies to you. In this example, the long type is used to indicate how many meta-data pairs exist. This also enables users to define sufficient meta-data information in actual applications. Each pair of meta-data information has a string-type key (prefixed with "Avro.") and binary encoded value. Each data block after the header information in the file has the following structure: a long value records the number of objects in the current block, A long value is used to record the number of bytes after the current block is compressed, the real serialization object, and the synchronization token with 16 bytes length. Because objects can be organized into different blocks, you can operate on a data block without deserialization. The number of data blocks, the number of objects, and the synchronization token can also be used to locate damaged blocks to ensure data integrity.
The above is the operation to serialize an Avro object to a file. Avro is also used as an RPC framework. When the client wants to interact with the server, it needs to exchange the communication protocol between the two parties. It is similar to the mode and needs to be defined by both parties. In Avro, it is called message ). Both parties must maintain this Protocol to facilitate parsing the data sent from the other party. This is the legendary handshake phase.
A message sent from a client to a server must go through the transport layer to send messages and receive responses from the server. The data that arrives at the transport layer is binary data. Generally, HTTP is used as the transmission model, and data is sent to the other party in post mode. In Avro, its messages are encapsulated into a buffer group, similar to the model:
For example, each buffer zone starts with four bytes, contains multiple bytes of buffered data, and ends with an empty buffer. The advantage of this mechanism is that the sender can easily assemble data from different data sources when sending data, and the receiver can store the data in different storage areas. In addition, when writing data to the buffer zone, a large object can exclusively occupy a buffer zone, instead of Mixed storage with other small objects, so that the recipient can easily read large objects.
Next we will talk about other information about Avro. In the previous article, Doug cutting was quoted as saying that when Protocol buffer transfers data, it adds annotation to the data to deal with the data structure and data mismatch. But it directly leads to the disadvantages of increasing data volume and difficult resolution. How does Avro cope with the difference between the model and data? To ensure the efficiency of Avro, assume that at least most of the modes are matched, and then define some validation rules. If the rules are met, perform data verification. If the mode does not match, an error is returned. In the same mode, when interacting with data, if a field is missing in the data, use the default value in the specification. If the data contains some data that does not match the mode. Ignore these values.
One of the advantages listed by Avro is that it can be sorted. That is to say, Avro programs supported by one language can sort undeserialized data by Avro programs in other languages after serializing data. I don't know what kind of mechanism is used in, but it looks pretty good.
At present, there is very little information about Avro. the above article is also summarized by the official document and the author's article. I believe that there must be many mistakes, and some of them may be wrong at all. Now I am releasing this summary to facilitate continuous revision and supplement. I also share my learning achievements over the past two days. I hope I can help some people who want to know about Avro, I also hope that you will prove that I understand what is wrong, which will help improve.
Other materials:
Avro specification: http://avro.apache.org/docs/current/spec.html
Doug cutting: http://www.cloudera.com/blog/2009/11/avro-a-new-format-for-data-interchange/
Performance Comparison of serialization systems: http://wiki.github.com/eishay/jvm-serializers/
Link: http://langyu.iteye.com/blog/708568