Protocol buffers Encoding

Last Update:2018-12-03 Source: Internet

Author: User

Tags deprecated

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

This document introduces the binary format of protocol buffer messages. You do not need to understand this when using protocol buffers in your application, however, it is very useful for you to know how different protocol buffer formats affect the size of your message encoding.

A simple message

Suppose you have the following simple message definition:

message Test1 {  required int32 a = 1;}

In an application, you create a test1 message and set the value of a to 150. Then you serialize the message to an output stream. If you want to check the message after the vehicle code, you will see the following three Bytes:

08 96 01

Base 128 varints

To understand the Protocol buffer encoding, you must first understand varints. Varints is a method that serializes integers into one or more bytes. The smaller the number, the smaller the number of bytes occupied.

Each byte in a varint, except the last byte, must be set with a most significant bit (MSB) to identify the next byte. The lower seven bits of each byte are used to store the binary complement of the number represented by seven bits, least significant group first.

For example, the number 1 is a single byte, so MSB is not required:

0000 0001

If it is 300, it is more complicated:

1010 1100 0000 0010

How do you infer that this is 300? First, you should discard the MSB of each byte, because MSB is only used to tell us whether it has reached the end of the number:

1010 1100 0000 0010→ 010 1100  000 0010

Then, two groups of 7-bit bytes are reversed, because varints Stores numbers in the principle that the least significant group first. Then the final value can be calculated:

000 0010  010 1100→  000 0010 ++ 010 1100→  100101100→  256 + 32 + 8 + 4 = 300

Message structure

A protocol buffer message is a series of key-value pairs. The binary version of a message uses the number of fields as the key. The name and declared type of each field are defined by referencing the message type at the end of decoding (for example. PROTO file.

When a message is encoded, keys and values are linked into a byte stream. When a message is decoded, the parser needs to skip the fields that it cannot recognize. In this way, new fields can be added to a message without interrupting the old program that does not know the new fields. The key in each key-Value Pair actually contains two values-the field numbers from the. proto file, and a type used to provide sufficient information to determine the next value length.

The available wire types are as follows:

Int32, int64, uint32, uint64, sint32, sint64, bool, Enum

Type	Meaning	Used
0	Varint
1	64-bit	Fixed64, sfixed64, double
2	Length-delimited	String, bytes, embedded messages, packed repeated Fields
3	Start Group	Groups (Deprecated)
4	End Group	Groups (Deprecated)
5	32-bit	Fixed32, sfixed32, float

Each key in a stream-encoded message is a varint with the value (field_number <3) | wire_type -- in other words, the last three bits of the number are used to store the wire type.

Now let's look at a simple example. You now know that the first number in the stream is always a varint key. Here is 08, or (throwing away MSB ):

000 1000

After you pass the last three bits, you can get that the wire type is 0, and then shift the right three digits to get that the field number is 1. So now you know that the tag is 1, and the subsequent value is a varint. Using the previous varint decoding, we can know that the value of the subsequent two bytes is 150.

96 01 = 1001 0110  0000 0001       → 000 0001  ++  001 0110 (drop the msb and reverse the groups of 7 bits)       → 10010110       → 2 + 4 + 16 + 128 = 150

More value types

Signed integer

All protocol buffer type 0 is encoded as varints. However, when the encoding is negative, the signed integer (sint32 and sint64) and Standard INTEGER (int32 and int64) are very different. If you use int32 or int64 as a negative number type, the corresponding varint is always the length of 10 bytes-it is treated as a very large unsigned integer. If you use a signed type, the corresponding varint uses the zigzag encoding, which is more efficient.

The zigzag encoding maps signed integers to unsigned integers, so numbers with smaller absolute values (such as-1) also have smaller varint encoded values. The practice is to repeatedly "zig-zags" between positive and negative numbers, so-1 is encoded into and 2, and-2 is encoded into 3, and so on, as shown in the following table:

Signed original	Encoded
0	0
-1	1
1	2
-2	3
2147483647	4294967294
-2147483648	4294967295

In other words, each value of N is encoded as follows for sint32:

(n << 1) ^ (n >> 31)

For sint64, encoded:

(n << 1) ^ (n >> 63)

Note (n> 31) is an arithmetic shift. In other words, the result number of this shift is either 0 for all bits (N is positive) or 1 for all bits (N is negative ).

When sint32 or sint64 is parsed, its value is decoded into a signed original value.

Non-varint number

Non-varint numeric types are also very simple-double and fixed64 have a wire type of 1, telling the parser to get a fixed 64-bit data; float and fixed32 have a wire type of 5, tells the parser to obtain a 32-bit data. In both cases, the values are stored in the byte sequence.

String

If the value of wire type is 2, it indicates that the value is a varint encoded length, followed by data with a specified number of bytes.

message Test2 {  required string b = 2;}

Set the value of B to "testing" and you will get:

12 07 74 65 73 74 69 6e 67

The red byte is the UTF-8-encoded "testing ". Key is 0x12-> tag = 2, type2. The length is 7, and the subsequent seven bytes are our strings.

Nested messages

The following example contains a nested message:

message Test3 {  required Test1 c = 3;}

The encoded result is as follows. The value of field a of test1 is set to 150:

 1a 03 08 96 01

We can see that the last three bytes are the same as the first example (08 96 10), indicating the number 150, followed by the number 3, nested messages are treated as strings (wire type = 2 ).

Optional and repeated Elements

If your message contains a repeated element (without the [packed = true] option), the encoded message has 0 or multiple key-value pairs with the same tag number. These duplicate values do not need to appear consecutively. They may be staggered with other fields. The order of elements is determined during parsing.

If the element is optional, the encoded message may have or does not have a key-value pair.

Under normal circumstances, an encoded message has no more than one optional or required field instance. However, the parser will also handle this situation. For numeric type and string, if the same value appears multiple times, the parser only accepts the last value. For nested message fields, the parser merges multiple instances with the same fields, just like the message: mergefrom Method -- that is, A single field of the subsequent instance will replace the previous one. A single nested message will be merged and the repeated fields will be linked. The effect of these rules is to parse the results of two encoded messages in tandem and parse the two messages separately and then merge them with the same results. Example:

MyMessage message;message.ParseFromString(str1 + str2);

It is equivalent:

MyMessage message, message2;message.ParseFromString(str1);message2.ParseFromString(str2);message.MergeFrom(message2);

Packed repeated field

The packed repeated field is introduced in version 2.1.0, and the repeated and [packed = true] options are also defined. A packed repeated field containing 0 elements does not appear in the encoding message. In addition, all elements of the field are packaged into a key-Value Pair and identified by wire type 2. Each element is encoded according to its own type.

For example, assume that you have the following message types:

message Test4 {  repeated int32 d = 4 [packed=true];}

Construct a test4, and the D field contains 3,270,869 42 values. The encoding result is as follows:

22        // tag (field number 4, wire type 2)06        // payload size (6 bytes)03        // first element (varint 3)8E 02     // second element (varint 270)9E A7 05  // third element (varint 86942)

Only the basic numeric type (varint, 32-bit, 64-bit) of repeated can declare "packed ".

Field order

You can use field numbers in any order in A. proto file. When a message is serialized, its known fields are continuously written in the order of Field Numbers. This allows the parsing code to rely on Field Numbers for optimization. However, the parser of protocol buffer should be able to parse fields in any order, because not all messages are created by serializing an object.

If a message has unknown fields, the current Java and C ++ implementations write them to the end of known fields in any order. Unknown fields are not considered in the face-to-face Python implementation.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More