into the fourth chapter, the main point of this article is coding (that is, serialization ) and code upgrade some scenarios, to comb the memory involved in the process of encoding and decoding. The current mainstream codec is from the Apache Avro, from the Facebook Thrift and Google protocolbuf, in this article, We will also comb the advantages and pain points of various codes.
1. Non-binary encoding format
A program typically processes data in at least two different representations:
1. In memory, data is stored in objects, structures, lists, arrays, hash tables, trees, and so on. These data structures are optimized in memory as a structure that the CPU can access and manipulate efficiently ( usually this is the task of the operating system and does not require the programmer to worry about it ).
2, and when you want to write data to a file or send it over the network, you have to encode it into some sort of byte sequence ( For example, a JSON document ).
Therefore, we need some kind of conversion between the two forms. (Memory and other locations) the translation of the data represented in memory is called encoding (also known as serialization) and, conversely, decoding (deserialization).
There are usually several formats for encoding:
- Specific language formats
Many programming languages have built-in support for encoding to encode memory objects into a sequence of bytes. For example: Java java.io.Serializable , Ruby Marshal, python pickle. But these programming languages have built-in inventory in some deep-seated problems.
- Coding is often bundled with a specific programming language, and reading data in another language is very difficult.
- In order to recover data in the same object type, the decoding process needs to be able to instantiate any class, and if an attacker can let your application decode any sequence of bytes, they can instantiate any class. This is often the source of security problems.
Efficiency (for encoding or decoding CPU time, and the size of the encoding structure), the Java built-in encoding library is notorious for its poor performance and bloated coding
- JSON, XML, and CSV
The above formats are also commonly seen in coding.
- The description of XML is very precise, but it is too lengthy.
- The popularity of JSON is largely due to its built-in support in Web browsers (since it is a subset of JavaScript) and its simplicity with respect to XML.
CSV is another popular language-independent format, albeit with no strong functionality.
JSON, XML, and CSV are all text formats and therefore have some readability. But they also have the following subtle questions:
- There is a lot of ambiguity about the number coding. In XML and CSV, it is not possible to differentiate between numbers and strings that happen to be numbers (in addition to referencing external schemas). JSON distinguishes between strings and numbers, but it does not differentiate between integers and floating point numbers, nor does it confirm precision.
- JSON and XML are supported for Unicode strings, but they do not support binary strings (byte sequences do not have character encodings).
- For both XML and JSON, there is an optional schema support. These pattern languages are very powerful, so learning and implementing them is quite complex. CSV does not have any patterns, so the application needs to define the meaning of each row and column. If the application adds a new row or column, you must manually process the update. CSV is a fairly vague format (for reasons of delimiter)
2. Binary encoding format
Binary encoding format is usually the most compact encoding format, for a small data set, the value of the code size is negligible, but once the million megabytes of data set, the choice of data format will have a great impact. Next, let's look at a data structure that is described in JSON:
- Messagpack
Let's take a look at the JSON format after binary encoding via Messagepack:
The binary encoding is 66 bytes long, which is only a bit smaller than the 81-byte text JSON encoding. With such a reduction in space, the readability of the protection is lost, and we look at the wood has a better solution.
- Thrift
The data in the thrift is encoded in a way that requires a pre-described pattern in the Thrift Interface Definition language (IDL):
In the thrift there are two different binary encoding formats, one is directly using binary encoded binary format, the other is the use of compressed compact format, we will look at the difference between the two.
Binary format is encoded with 59 byte size, and each field has a type comment that indicates that it is a string, Integer, list, and so on, and specifies the length indication (the length of the string, the number of items in the list), if required. However, information such as field names is omitted as compared to Messagepack, and the fields are labeled (3), which are numbers that appear in the pattern definition. Field tags are similar to field aliases, and they are a neat way to describe the fields we're talking about without having to spell the name of the segment. This reduces the size of the binary encoding.
Compact format it contains the same information only 34 bytes. It does this by packaging the field type and tag number into a single byte and using variable-length integers. Instead of using eight full bytes for number 1337th, it is encoded in two bytes, and the highest bits per byte are used to indicate if there are more bytes to come. This means that numbers between 64 and 63 are encoded in one byte, numbers between 8192 and 8191 are encoded in two bytes, and larger numbers use more bytes.
Protocolbuf
Protocolbuf (only one binary encoding format) the same data is encoded as shown. It is slightly different in packaging, but the thrift compact format is similar. The PROTOBUF matches the same record with 33 bytes.
Avro
Avro is a binary encoding format that originates from the open source project, Hadoop, as a replacement for thrift, so let's take a look at the records after Avro encoding.
There are no marks in the Avro mode. Encoding the same data, the Avro binary encoding is 32 bytes long and is the most compact of the above encodings. Check the above byte sequence, and there are no marked segments or data types. Encodings are simply composed of values that are concatenated together. When parsing binary data, use patterns to determine the data type of each field. This means that if the code that reads the data uses exactly the same pattern as the code that writes the data, the binary data can be decoded correctly.
3. Schema Upgrade and Evolution
As the application develops, the pattern inevitably needs to change over time. And in this process, binary encoding keeps backwards and forwards compatible?
- Field Markers
- As you can see from the example, the encoded record is just a concatenation of the coded fields. Each field is identified by the label number and the data type of the comment (such as a string or integer). If you do not set a field value, you simply omit the field value from the encoded record. Therefore, field markers are critical to the meaning of encoded data. We can change the name of a field in a pattern because the encoded data never references the field name, but cannot change the field's markup because it invalidates all existing encoded data.
- You can add a new field to the schema by adding a new marker number. If the old code (not knowing the new tag number you added) tries to read data written by the new code, including a new field whose tag number is not recognized, it can simply ignore the field. Data type annotations allow the parser to determine how many bytes need to be skipped. Because each field has a unique tag number, the new code can seamlessly connect to the old data because the tag number still has the same meaning. However, if you add a new field, you cannot make it a required field. If you add a field and make it a required field, the check will fail if the new code reads the data written by the old code, because the old code will not write to the new field you added. Therefore, in order to maintain backward compatibility, each field that is added after the initial deployment mode must be optional or have a default value.
Deleting a field is like adding a field, which means you can only delete an optional field (the required field cannot be deleted), and you cannot use the same token again (because you may have a data that contains the old tag number, which must be ignored by the new code).
Data type
How do I change the data type of a field? For example, convert a 32-bit integer to a 64-bit integer. The new code can easily read the data written by the old code, because the parser can fill any missing bits with 0. However, if the old code reads the data written by the new code, the old code still uses a 32-bit variable to hold the value. If the decoded 64-bit value does not fit 32 bits, it is truncated.
Protocolbuf does not have a list or array of data types, but instead has a repeating tag field. You can convert an optional (single-value) field to a repeating (multivalued) field. The new code reading the old data sees a list with 0 or one element (depending on whether the field exists); The old code that reads the new data sees only the last element of the list. Instead, Thrift has a dedicated list data type, which is the data type in the parameter list. This does not allow for upgrades from single-value to multi-valued like PROTOCOLBUF, but it has the advantage of supporting nested lists.
Dynamic Generation Mode
The biggest feature of Avro is that it supports the dynamic generation mode, and its core idea is that the coder and decoder mode can be different, in fact they just need to be compatible. It does not contain any label numbers compared to PROTOCOLBUF and thrift. Each time the database schema changes, the administrator must manually update the mapping from the database column name to the field tag. And Avro is a simple mode conversion every time the runtime is run. Any program that reads a new data file will perceive that the record field has changed.
4. Summary
Coding details not only affect productivity, but more importantly, the architecture of applications and software. Prorotocol Buf,thrift and Avro use a pattern to describe a binary encoding format. Their schema language is much simpler than the XML Schema or JSON schema, it supports more detailed validation rules, and is better able to evolve the schema and improve performance.
Coding and pattern------"Designing data-intensive Applications" Reading notes 5