Protocol buffer (data encoding)

Source: Internet
Author: User

I have published three technical blogs about protocol buffer. The first one introduces the Protocol buffer language specification, the next two articles provide some relatively practical and simple examples based on C ++ and Java respectively. Due to the recent high work pressure, it is indeed difficult for me to continue writing this blog for a few days. But every time you think of a good ending, there is no failure at the beginning, the final decision should be done with the best efforts, rather than leaving any regrets.
The content of this blog will be taken entirely from the official Google documentation. It only adds appropriate annotations for some technical points that are relatively difficult to understand. However, due to limited technical capabilities, if the explanation is incorrect, please correct me.
This is a document that allows you to know protocol buffer as well. Even if you do not know the technical details and processing mechanisms, you can still use it in your applications.ProgramProtocol buffer is used normally. However, I believe that in-depth understanding of these details and mechanisms will not only allow you to better use and control protocol buffer, in addition, we can deeply feel the wisdom and superb programming skills of Google engineers. Therefore, in my opinion, in-depth study is of great benefit to the improvement of our programming capabilities and broaden our thinking. As a result, there are thousands of miles.

1. simple message encoding layout:
Let's take a look at the following message definition example in the queue:
Message test1 {
Required int32 A = 1;
}
Suppose we set the value of field a to 150 (decimal) in the application, and then serialize the object to the binary file. You can see that the data in the file is:
08 96 01
What are the meanings of these three bytes? What encoding rules do they follow? Let's wait and see.

Ii. Base 128 varints:
Before understanding the Protocol buffer encoding rules, you must first understand varints. Varints is a method that uses one or more bytes to represent integer data. The smaller the value itself, the less bytes it occupies.
In varint, each byte except the last one contains An MSB (most significant bit) setting (using the highest bit ), this indicates whether the subsequent bytes are used together with the current byte to represent the same integer value. The other seven bytes are used to store the data. Therefore, we can briefly explain base 128. Generally, integer values are expressed in bytes, with each byte being 8 bits, that is, base 256. However, in Protocol buffer encoding, the highest bit is MSB, and only the last seven digits store the actual data. Therefore, we call it base 128 (the 7th Power of 2 ).
For example, the number 1 occupies only one byte, so its MSB is not set, for example:
0000 0001
For example, the decimal number 300 is encoded in the following format:
1010 1100 0000 0010
For protocol buffer, how does one restore the above byte layout to 300? The first step is to drop the MSB of each byte. From the above example, we can see that the MSB (highest bit) of the first byte (1010 1100) is set to 1, which means that the subsequent bytes will represent the same value together with this byte, the MSB of the second byte (0000 0010) is 0, so this Byte will be the last byte that represents the value. If there are other byte data, it indicates other data.
1010 1100 0000 0010
-> 010 1100 000 0010
The second row in the preceding example has removed the MSB of each byte in the first row. Because protocol buffer uses little endian for data layout, We need to flip the positions of two bytes here.
 010 1100 000 0010
-> 000 0010 010 1100 // Flip the two bytes of the first line
-> 100101100 // Connect the two bytes after the flip and remove the high value 0
-> 256 + 32 + 8 + 4 = 300 // Convert the binary data of the previous row to decimal with the value 300

Iii. Message structure:
Messages in Protocol buffer are composed of a series of key-value pairs. The binary version of each message uses the tag number as the key, and the name and type of each field are based on the target type (the object type after deserialization) during decoding). When the message is encoded, the key/value is connected to throttling. During decoding, the parser can directly skip unrecognized fields to ensure the compatibility between the new and old versions of message definitions and between new and old programs, this effectively avoids parsing and object initialization errors when the older program uses the older Message format to parse the newer message sent by the newer program. Finally, we will introduce how the field label and field type are encoded. The following lists the types of fields supported by Protocol buffer.

Type Meaning Used
0 Varint Int32, int64, uint32, uint64, sint32, sint64, bool, Enum
1 64-bit Fixed64, sfixed64, double
2 Length-delimited String, bytes, embedded messages, packed repeated Fields
3 Start Group Groups (Deprecated)
4 End Group Groups (Deprecated)
5 32-bit Fixed32, sfixed32, float

Since the key of each field after encoding is of the varint type, the key value is obtained by combining the field label and field type. The formula is as follows:
Field_number <3 | field_type
The last three bits of the key are used to store the type information of the field. When this encoding is used, the field types supported by Protocol buffer will not exceed 8. Here, we can further calculate that the number of fields supported by Protocol buffer in a message is reduced by one by the 29th power of 2. Now let's review the origin of the first byte 08 after the serialized test1 message.
0000 1000
-> 000 1000 // Drop MSB (highest bit)
The lowest three bits indicate the field type, that is, 0 is varint. Then we move the result three places (> 3) to the right. The result is 1, that is, the tag number of field a in message test1. With this result, the decoder of protocol buffer can know that the label number of the current field is 1, and the type of data that follows it is varint. Now we can continue to use the knowledge above to analyze the origins of the last two bytes (96 01.
96 01 = 1001 0110 0000 0001
-> 001 0110 000 0001 // Drop two bytes of MSB
-> 000 0001 001 0110 // Flip high/low bytes
-> 10010110 // Remove the useless 0 in the highest bit
-> 128 + 16 + 4 + 2 = 150

4. More value types:
1. signed integer
As mentioned above, Type 0 indicates varint, which contains int32/int64/uint32/uint64/sint32/sint64/bool/enum. In actual use, if the current field can be expressed as a negative number, there will be a large difference in encoding for int32/int64 and sint32/sint64. If int32/int64 is used to indicate a negative number, whether it is-1 or-2147483648, the encoded length of this field will always be 10 bytes, just like a very large unsigned integer. If sint32/sint64 is used, protocol buffer usesZigzagEncoding method. The encoded result is more efficient.
Here we will briefly describe the zigzag encoding, which maps signed integers to unsigned integers so that a negative number with a smaller absolute value can still have a smaller varint encoded value, such as-1. The following is the zigzag table:

Signed original Encoded
0 0
-1 1
1 2
-2 3
2147483647 4294967294
-2147483648 4294967295

The formula is as follows:
(N <1) ^ (n> 31) // Sint32
(N <1> ^ (n> 63) // Sint64
It should be noted that Protocol buffer adopts the arithmetic displacement when implementing the above displacement operations. Therefore, for (n> 31) and (N> 63, if n is a negative displacement, the result is-1. Otherwise, it is 0.
Note: The arithmetic displacement and logical displacement in C language are briefly explained. Their left-shift operations are the same, that is, the low-position value is 0, and the high-position value is removed directly. The difference is that the right shift operation, logical displacement is relatively simple, and all the high positions are filled with 0. The arithmetic displacement depends on the sign bit of the current value. The complement bit is the same as the sign bit, that is, the positive number is completely supplemented by 0, and the negative number is completely supplemented by 1. In other words, the consistency of the symbol bit must be ensured when the arithmetic shift is right. In C language, if int variable displacement is used, it is the arithmetic displacement, and the uint variable displacement is the logical displacement.
2. Non-varint numeric type
Double/fixed64 always occupies 8 bytes, and float/fixed32 always occupies 4 bytes.
3. Strings
The type value is 2, the key information is followed by the length information of the byte array, and the actual data value is followed by the specified length. For example:
Message Test2 {
Required string B = 2;
}
Now we set the value of B to "testing ". The encoded data is as follows:
12 0774 65 73 74 69 6e 67
The first byte 0x12 indicates the key. The Field Type 2 and field number 2 can be obtained through decoding. The second byte 07 represents the length of testing. The next seven red highlighted bytes represent testing.

5. Embed messages:
Here is a message definition that contains embedded messages.
Message test3 {
Required test1 c = 3;
}
In this case, we first set the value of field a of test1 to 150, and the encoding result is as follows:
1A 03 08 96 01
From the above results, we can see that 08 96 01 is exactly the same as test1, but the key (Field Type + label) and length information are added before. The decoding method and meaning of the new information are exactly the same as those of the previous strings.

6. Packed repeated fields:
Protocol buffer from2.1.0Version introduced[Pack = true]Field-level options. If this option is set, the repeated field with 0 elements will not be encoded; otherwise, all elements in the array will be encoded into a single key/value form. After all, each element in the array has the same field type and label. This encoding method can save more space for integer elements that contain smaller values. For example:
Message test4 {
Repeated int32 d = 4 [pack = true];
}
Here we assume that the D field contains three elements with the values of 3,270,869 42, respectively. The encoding result is as follows:
22 // Key (field number 4, type 2)
06 // The number of bytes occupied by all elements in the data
03 // The first element (varint 3)
8e 02 // The second element (varint 270)
9e A7 05 // The third element (varint 86942)

VII. Field order:
When the message field label is defined in the. proto file, it can be discontinuous. However, if it is defined as a continuously increasing value, the encoding and decoding performance will be better.

Conclusion:
This blog is the last blog in the Protocol Buffer technology detail series. At the same time, this blog series will be the first series in the open-source Learning Journey series, we hope that in the future, we can use this platform to conduct more technical exchanges and jointly improve our performance. If you have any comments or questions, please leave a message.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.