The principle of PROTOBUF serialization

Last Update:2018-07-26 Source: Internet

Author: User

Tags deprecated numeric serialization

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

varints

Before understanding the PROTOBUF coding principle, the first thing to understand is varints.
Varints is a method of serializing integers with one or more bytes. The smaller the value of an integer, the less the number of bytes to occupy.

The first bit (MSB) of each byte of a varints is used to indicate whether the next byte is or is used to represent the integer. It can be viewed as a next pointer to a single-linked list, 1 points to the next node, and 0 to NULL. The 7 bits of the lower bit of varints store integers in twos complement, low-endian (small-endian) Form.

For example:
There are integers that are represented by the following varints:
1010 1100 0000 0010

To decode the original value, first remove each byte of the MSB (the MSB is just used to tell if the integer is not at the end, calculate the value is not used), and get the form without the MSB:
010 1100 000 0010

Then, since the varints is low-endian, a reverse (reverse) operation is required for the byte first, resulting in:
000 0010 010 1100
→100101100
→256 + + 8 + 4 = Message

PROTOBUF message is a series of key-value pairs, and the binary form of the message uses the field's tag value as the key, and its field name and the actual data type can only be determined at the decoding end based on the type definition of the message.

When you encode a message, key and value are concatenated into a byte stream. When decoding a message, the interpreter has the ability to skip fields that he does not recognize, so that new fields can be added to the message without affecting old programs that do not recognize the new fields.

To do this, each key on each byte stream is actually made up of two values-one is the tag number of the field and the other is
The wire type of the field, which is used to provide enough information to clarify the length of the field's worth.

The following is an optional wire type:

for

Type	meaning	used
0	Varint	Int32, Int64, UInt32, UInt64, Sint32, Sint64, BOOL, enum
1	64-bit	FIXED64, SFIXED64, double
2	length-delimited	String, bytes, embedded messages, packed repeated fields
3	Start Group	Groups (deprecated)
4	End Group	Groups (deprecated)
5	32-bit	FIXED32, SFIXED32, float

Each key on the byte stream is varint. The value of key is (Field tag Number<<3 | Wire_type), which is the last 3 bits used to store the wire type of the field.

For example, there is a definition of the following message:
Message Test1 {
Required Int32 a = 1;
}

In an application, a Test1 message,a value of 150 is generated. When this message is serialized into binary form, it gets 3 bytes:
08 96 01

The key in binary is 08, i.e.:
0000 1000

① a wire type of 0 (Varint) from the last 3 bits to determine the field
② then move the key to the right 3 bit, get 0000 0001,
Know that the tag number for this field is 1.

The values stored in the following two bytes are parsed according to Varint:

1001 0110 0000 0001
        →000 0001 + + 001 0110 (remove MSB and invert bytes)
        →10010110
        →150

More data Types signed integers

Signed integer types (sint32, Sint64) and "Standard" Integer types (Int32, Int64) are encoded into varints, but when their values are a negative number, there is a significant difference.

When you use Int32 and int64 to store negative numbers, you always produce a 10-byte varint--it's considered a very large unsigned integer.

If one of the signed integers (Sint32, sing64) is used, Varint are produced using zigzag encoding, which is more efficient than Int32 and Int64.

The ZIGZAG encoding uses "Zig-zags" to back up the forward method between positive and negative numbers, so that 1 is encoded into 1, 1 encoded into 2,-2 encoded into 3, and so on:

signed Original	encoded
0	0
-1	1
1	2
-2	3
2147483647	4294967294
-2147483648	4294967295

In other words, each value n is encoded according to the following formula:

For Sint32:
(n << 1) ^ (n >> 31)

For Sint64:
(n << 1) ^ (n >> 63)

Note that the second shift here,--n >> 31, is an arithmetic shift, that is, if n is positive, then the result of the shift is that all bits are 0, and if n is negative, all bits are 1.

When parsing sint32 or Sint64, its values are decoded into the original signed form.

As an example:

A =-1  →   0xFFFFFFFF (the computer uses the complement to indicate negative numbers)
a << 1  →   0xFFFFFFFE  →   -2
a >> 31→   0xFFFFFFFF  →   -1

(a << 1) ^ (a >> +)    →   0xFFFFFFFE ^ 0xffffffff→   0x00000001< c15/>→   1

Strings

Wire Type 2 (length-delimited) indicates that a value of this field consists of a varint n and a subsequent n-byte data.

Message Test2 {
    Required String b = 2;
}

The value of Set B is "testing", and serialization gets:

6e 67

The red byte is UTF8 encoded "testing", key is 0x12→tag = 2, type = 2,. The length has a varint representation, that is, 0x07→7, so the following 7 bytes are the string "testing". embbed Messages

When serializing, the inline message is actually handled the same way as the string.

Message Test3 {
    required Test1 c = 3;
}

Set the A field of Test1 C to 150, serialization will get:

1a 03 08 96 01

As you can see, the last 3 bytes are actually the same as our previous example (08 96 01). Optional and repeated Elements

Normally, a serialized message does not appear the same non-repeated field multiple times. However, the parser sets the processing method for this situation.

For numeric and string types, if the same field appears more than once, the parser takes the last occurrence of the value as the final result.

For inline message, the parser merges multiple values from the same field, just as you would with the Message::mergefrom method.

These rules enable the analysis of two concatenated serialized message actually equal to parse the two message separately and then execute merge with them to get the final result. That

Mymessage message;
Message. Parsefromstring (str1 + str2);

And this is the same:

Mymessage message, Message2;
Message. Parsefromstring (STR1);
Message2. Parsefromstring (STR2);
Message. Mergefrom (Message2);

Packed repeated fields

For Proto2 message, if a repeated field is not set [Packed=true], then the serialization will have 0 or more key-value pairs, each containing a tag, as in the other type fields mentioned above. The key-value pairs of these repeated are not necessarily contiguous, and may appear interleaved with other fields. But the key value of each repeated is consistent with the position in which it appears and the order in which they actually should be.

From the 2.1.0 release, Protobuf introduced a packed repeated fields, declared in the same way as repeated fields, just need to set the [packed=true] option.

In Proto3, repeated fields is packed repeated fields by default.

Packed repeated field is different from the general repeated field, for multiple values, after serialization, only one key-value pair will be produced, and for each value it will be serialized according to its type (but without tag). All values are then stitched together and packaged into a value as wire type 2 (length-delimited).

For example

Message Test4 {
    repeated int32 d = 4 [packed=true];
}

Generate a test4,d of {3, 270, 86942}. Serialization of the message will get:

          x//Tag number 4, wire type 2          //length 6 bytes The          first element is Varint 3
8E       //The second element is Varint 270
  9e A7    //3rd element is Varint 86942

Only the repeated field of the original numeric type (Varint, 32-bit, 64-bit) can be declared as "packed".

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More