Schema evolution in Avro, protocol buffers and thrift

Source: Internet
Author: User

Http://martin.kleppmann.com/2012/12/05/schema-evolution-in-avro-protocol-buffers-thrift.html

When you want to store data, such as objects or other types of data, to a file or transmit data over the network, you need to face the serialization problem.
For serialization, each language provides corresponding packages, such as Java serialization, Ruby's marshal, or Python's pickle.

Everything is fine, but if you consider cross-platform and language, you can use JSON or XML
If you cannot stand the efficiency of verbose and parse in JSON or XML, the problem arises. Of course, you can try to invent a binary code for JSON.

Of course there is no need to duplicate the wheel, thrift, protocol buffers or Avro, provide efficient, cross-language serialization of data using a schema, and code generation for the Java folks.

So you have some data that you want to store in a file or send over the network. You may find yourself going through several phases of evolution:

  1. Using your programming language's built-in serialization, such as Java serialization, Ruby's marshal, or Python's pickle. Or maybe you even invent your own format.
  2. Then you realize that being locked into one programming language sucks, so you move to using a widely supported, language-agnostic format like JSON (or XML if you like to party like it's 1999 ).
  3. Then you decide that JSON is too verbose and too slow to parse, you're annoyed that it doesn't differentiate integers from floating point, and think that you 'd quite like binary strings as well as Unicode strings. so you invent some sort of binary format that's kinda like JSON, But binary (1, 2, 3, 4, 5, 6 ).
  4. Then you find that people are stuffing all sorts of Random Fields into their objects, using inconsistent types, and you 'd likeSchemaAnd someDocumentation, Thank you very much. perhaps you're also using a statically typed programming language and want to generate model classes from a schema. also you realize that your binary JSON-lookalike actually isn't all that compact, because you're still storing field names over and over again; hey, if you had a schema, you cocould avoid storing objects 'field names, and you cocould save some more bytes!

Once you get to the fourth stage, your options are typically thrift, protocol buffers or Avro. all three provide efficient, cross-language serialization of data using a schema, and code generation for the Java folks.

 

In actual use, data is constantly changing, so schema is always evoluation. Thrift, protobuf and Avro support this feature, ensure that the normal service is not affected when the schema of the client or server changes.

In real life, data is always in flux. the moment you think you have finalized a schema, someone will come up with a use case that wasn' t anticipated, and wants to "just quickly add a field ". fortunately thrift, protobuf and Avro all supportSchema evolution: You can change the schema, You can have producers and consumers with different versions of the schema at the same time, and it all continues to work. that is an extremely valuable feature when you're dealing with a big production system, because it allows you to update different components of the system independently, at different times, without worrying about compatibility.

 

The focus of this article is to compare whether thrift, protobuf, and Avro serialize data to binary and support schema evoluation.

The example I will use is a little object describing a person. In json I wocould write it like this:

{    "userName": "Martin",    "favouriteNumber": 1337,    "interests": ["daydreaming", "hacking"]}

This JSON encoding can be our baseline. If I remove all the whitespace it consumes82Bytes.

Protocol Buffers

The Protocol buffers schema for the person object might look something like this:

message Person {    required string user_name        = 1;    optional int64  favourite_number = 2;    repeated string interests        = 3;}
First, Pb uses IDL to represent the schema of person.
Each field has a unique tag as the identifier, so = 1, = 2, = 3 is not a value assignment, indicating the tag of each field
Then each field can beoptional,requiredAndrepeated

When we encode the data above using this schema, it uses 33 bytes, as follows:

Clearly, if you convert the JSON format of 82bytes to the binary format of 33bytes

First, only the tag is recorded during serialization, but the name is not recorded. Therefore, fieldname can be changed without any impact, and the tag can never be changed.

The tag and type of the first byte record, followed by the specific data. length must be added to the string.

We can see that during the encoding process, no special records are recorded.optional,requiredAndRepeated

During decoding, the required field will be validated check, but for opitonal and repeated, if not, it cannot be completely out of the encoding data.
Therefore, opitonal and repeated can be simply deleted from the schema, for example, on the client. However, note that the tag of the deleted field cannot be used again.
However, changes to the required field may cause problems. For example, if you delete the required field on the client, the server-side validation check will fail.

If a new tag is used to add a field, no problem occurs.

 

Thrift

Thrift is a much bigger project than Avro or Protocol buffers, as it's not just a data serialization library, but also an entire RPC framework.

It also has a somewhat different culture: incluavro and protobuf standardize a single binary encoding, thrift embraces a whole variety of different serialization formats (which it CILS "protocols ").

Thrift has powerful functions. It is not only a data serialization library, but also a complete RPC framework that supports a complete RPC protocol stack.

The protocal encapsulation not only supports binary encoding, but also supports other encoding protocols.

Thrift IDL and Pb are actually very similar. The difference is that 1 (rather than = 1) is used to mark the field tag, and nooptional,requiredAndRepeated type

All the encodings share the same schema definition, in thrift IDL:

struct Person {  1: string       userName,  2: optional i64 favouriteNumber,  3: list<string> interests}

The binaryprotocol encoding is very straightforward, but also fairly wasteful (it takes 59 bytes to encode our example record ):

The compactprotocol encoding is semantically equivalent, but uses variable-length integers and bit packing to reduce the size to 34 Bytes:

As mentioned above, thrift can encapsulate different encoding methods through protocal. For binary encoding, there are also two options.

The first type is simple binary encoding without any space optimization. It can be seen that a lot of space is wasted and requires 59 bytes.

The second is compact binary encoding, which is similar to Pb encoding. The difference is that thrift is more flexible than Pb and can directly support container, such as list. pb can only implement a simple data structure through repeated. (thrift defines an explicit list type rather than protobuf's repeated field approach)

Avro

Avro schemas can be written in two ways, either in a JSON format:

{    "type": "record",    "name": "Person",    "fields": [        {"name": "userName",        "type": "string"},        {"name": "favouriteNumber", "type": ["null", "long"]},        {"name": "interests",       "type": {"type": "array", "items": "string"}}    ]}

... Or in an IDL:

record Person {    string               userName;    union { null, long } favouriteNumber;    array<string>        interests;}

Notice that there are no tag numbers in the schema! So how does it work?

Here is the same example data encoded in just 32 bytes:

Avro is a relatively new solution, which is currently used by a relatively small number of people, mainly in hadoop. It is also unique in design, compared with thrift and Pb.

Schema can be defined using IDL and JSON. Note that binary encoding does not store the field tag and field type.

Meaning,

1. The reader must match the schema file in parse data.

2. If there is no field tag, only field name can be used as the identifier. Avro supports field name change, but all readers must be notified first, as shown below:

Because fields are matched by name, changing the name of a field is tricky. You need to first update allReadersOf the data to use the new field name, while keeping the old name as an alias (since the name matching uses aliases from the reader's schema ). then you can update the writer's schema to use the new field name.

3. When reading data, it is read in sequence according to the definition order of the schema field. Therefore, optional field must be specially processed. For exampleunion { null, long }

If you want to be able to leave out a value, you can use a union type, likeunion { null, long }Above. This is encoded as a byte to tell the parser which of the possible union types to use, followed by the value itself. By making a union withnullType (which is simply encoded as zero bytes) You can make a field optional.

4. you can use JSON to implement schema. For thrift or Pb, you can only convert schema into specific code through IDL. therefore, Avro can implement common clients and servers. When schema changes, you only need to change JSON without re-compiling.

When the schema changes, Avro is easier to process. You only need to notify all readers of the new schema.

For thrift or Pb, the code of the client and server needs to be re-compiled when the schema is changed. Although the two versions do not match, it also provides better support.

5. The schema of writer does not necessarily match the schema of reader. Avro parser can use resolution rules for data translation.

So how does Avro support schema evolution?

Well, although you need to know the exact schema with which the data was written (the writer's schema ), that doesn't have to be the same as the schema the consumer is expecting (the reader's schema ). you can actually giveTwo differentSchemas to the Avro parser, and it uses resolution rules to translate data from the writer schema into the reader schema.

6. supports simple addition or reduction of Field

You can add a field to a record, provided that you also give it a default value (e.g.nullIf the field's type is a Unionnull). The default is necessary so that when a Reader using the new schema parses a record written with the old Schema (and hence lacking the field), it can fill in the default instead.

Conversely, you can remove a field from a record, provided that it previusly had a default value. (This is a good reason to give all your fields default values if possible .) this is so that when a Reader usingOldSchema parses a record written withNewSchema, it can fall back to the default.

One important question is not discussed. Avro depends on the JSON schema. When will schema data be transferred between the client and the server?

The answer is that there are different methods in different scenarios... during the handshake of the connection through the file header...

This leaves us with the problem of knowing the exact schema with which a given record was written.

The best solution depends on the context in which your data is being used:

  • In hadoop you typically have large files containing millions of records, all encoded with the same schema. object container files handle this case: They just include the schema once at the beginning of the file, and the rest of the file can be decoded with that schema.
  • In an RPC context, it's probably too much overhead to send the schema with every request and response. but if your RPC framework uses long-lived connections, it can negotiate the schema once at the start of the connection, and amortize that overhead over your requests.
  • If you're storing records in a database one-by-one, you may end up with different schema versions written at different times, and so you have to annotate each record with its schema version. if storing the schema itself is too much overhead, you can use a hash of the schema, or a sequential schema version number. you then need a schema registry where you can look up the exact schema definition for a given version number.

Compared with thrift and Pb, Avro is more complex and difficult to use. Of course, it has the following advantages...

At first glance it may seem that Avro's approach suffers from greater complexity, because you need to go to the additional effort of distributing schemas.

However, I am beginning to think that Avro's approach also has some distinct advantages:

  • Object container files are wonderfully self-describing: The writer schema embedded in the file contains all the field names and types, and even documentation strings (if the author of the schema bothered to write some ). this means you can load these files directly into interactive tools like pig, and it just works without any configuration.
  • As Avro schemas are JSON, you can add your own metadata to them, e.g. describing application-level semantics for a field. And as you distribute schemas, that metadata automatically gets distributed too.
  • A schema registry is probably a good thing in any case, serving as documentation and helping you to find and reuse data. and because you simply can't parse Avro data without the schema, the schema registry is guaranteed to be up-to-date. of course you can set up a protobuf schema registry too, but since it's notRequiredFor operation, it'll end up being on a best-effort basis.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.