Google protocol buffers encoding (encoding)

Source: Internet
Author: User
Google protocol buffers encoding (encoding) 1. Overview

The first three articles, Google protocol buffers overview, Google protocol buffers getting started, and Protocol buffers syntax guide, bring everyone into the Protocol buffers world step by step, we have basically been able to use protocol buffers to generate code, encode, parse, and output-level read serialized data. This article mainly describes the underlying binary format of Pb message. Without understanding this part of content, it does not affect the use of protocol buffers in the project. But it is necessary to understand how the Pb format implements the smaller layer. The binary messages generated after protobuf serialization are very compact, thanks to protobuf's clever encoding method.

2. A simple example

The. proto file defines a simple message:

?
123 message Test1 {   required int32 a = 1; }

Use the. prototype to generate a similar category and write a messageinto a file. Here I write the test.txt file:

?
123456 public static void main(String[] args) throws IOException {     Simple simple = Simple.newBuilder().setId(150).build();     FileOutputStream output = new FileOutputStream("abc.txt");     simple.writeTo(output);     output.close(); }

Open with ultraedit and view the binary format. It is found that only three bytes are occupied:

The entire message storage only uses three bytes, or even smaller than the size of an integer. What does this mean? How can this problem be achieved? The binary messages generated after protobuf serialization are very compact, thanks to protobuf's clever encoding method.

3. varint

Before learning about Pb encoding, let's take a look at varint. Varint is a compact numeric representation method. It uses one or more bytes to represent a number. The smaller the value, the fewer bytes are used. This reduces the number of bytes used to indicate numbers.

The highest bit of each byte in varint has a special meaning. If this bit is 1, it indicates that the subsequent byte is also part of the number. If this bit is 0, it ends. The other 7 bits are used to represent numbers. Therefore, numbers smaller than 128 can be expressed in a byte. A number greater than 128 uses two bytes.

For example, the expression of integer 1 requires only one byte:

0000 0001

For example, the value 300 requires two bytes:

1010 1100 0000 0010

Varint is used. A small int32 number can be expressed as one byte. Of course, everything is both good and bad. varint notation is used, and a large number is represented by five bytes. From the statistical point of view, generally, not all messages contain a large number of numbers. Therefore, in most cases, after varint is used, a smaller number of bytes can be used to represent numerical information.

Demonstrate how Google protocol buffer parses two bytes. Note that the two bytes are exchanged once before the final calculation, because the Google protocol buffer uses the little-Endian method in the byte sequence.

 

 

 

 

 

 

4. Message format

After the message is serialized, it becomes a binary data stream. The data in the stream is a series of key-value pairs. As shown in:

Using this key-pair structure, you do not need to use separators to separate different fields. For optional fields, if the field does not exist in the message, the field is not in the final message buffer. These features help to save the size of the message.

 

Messages in binary format use digital tags as keys to identify specific fields. When unpacking, based on the key, protocol buffer can know which field of the message the corresponding value corresponds.

After the message is encoded, key-values is encoded into word-based throttling storage. During Message decoding, the Pb parser skips (ignores) unrecognized fields. Therefore, even if a new field is added to the message, the old program code is not affected, because the old program code cannot identify these newly added fields. For this reason, the key must be specially designed.

As we can see above, "binary message uses digital tags as the key". The numeric tags here are not simply numeric tags, but combinations of numeric tags and transmission types, the length of the value can be determined based on the transmission type.

Key definition:

(Field_number <3) | wire_type

The key consists of two parts. The first part is field_number, and the second part is wire_type. The transmission type of value. That is to say, the last three digits in the key are worth the transmission type. For more information about easy shift operations, see Java bit operation basics.

Possible types of wire types are shown in the following table:

Type Meaning Used
0 Varint Int32, int64, uint32, uint64, sint32, sint64, bool, Enum
1 64-bit Fixed64, sfixed64, double
2 Length-delimi String, bytes, embedded messages, packed repeated Fields
3 Start Group Groups (Deprecated)
4 End Group Groups (Deprecated)
5 32-bit Fixed32, sfixed32, float
5. analyze and generate data

In the second simple example, after the message is written, we can see that the final output file contains three numbers: 08 96 01. How can this problem be solved?

So far, we know that the number label is 1 and the value type is varint. The fourth part is analyzed to decode 96 01, that is, 150:

?
1234 96 01 = 1001 0110  0000 0001        → 000 0001  ++  001 0110 (drop the msb and reverse the groups of 7 bits)        → 10010110        → 2 + 4 + 16 + 128 = 150

Note: The value is in front of the low position and in the back position.

6. Other numeric types: 6.1 signed integers

Readers may see that the int32 and sint32 data types that can be expressed by Type 0 are very similar. The main intention of Google protocol buffer to distinguish them is to reduce the number of bytes After encoding. This part is mainly designed for negative numbers.

In a computer, a negative number is generally expressed as a large integer, because the computer defines a negative number as the highest digit. If varint is used to indicate a negative number, the length must be 10 bytes. Therefore, Google protocol buffer defines the sint32 type, which adopts the zigzag encoding. Map All integers to unsigned integers and then encode them using varint encoding. In this way, an integer with a small absolute value also has a smaller varint encoded value.

The zigzag ing function is:

Zigzag (n) = (n <1) ^ (n> 31), when n is sint32

Zigzag (n) = (n <1) ^ (n> 63), when n is sint64

In this way,-1 will be encoded as and will be encoded as 2, and-2 will be encoded as 3, as shown in the following table:

Signed original Encoded
0 0
-1 1
1 2
-2 3
2 4
-3 5
... ...
2147483647 4294967294
-2147483648 4294967295
6.2 non-varint number

Non-varint numbers are relatively simple. The line types of double and fixed64 are 1. In the analytical expression, the parser needs a 64-bit data block. Similarly, if the line type of float and fixed32 is 5, you can give it a 32-bit data block. In both cases, both are high and low.

6.3 string

Data of the line type 2 is encoded by specifying the length: Key + Length + content, the key encoding method is unified, and the length method is varints encoding, content is the bytes with the length specified by length. Define the following message format:

?
12345 message Test2 {   required string b = 2;   }

Set this value to "testing" and view the binary format:

12 07 74 65 73 74 69 6e 67

Utf8 code with the red Byte "testing.

Here, the key is in hexadecimal notation, so the expansion is:

12-> 0001 0010, the last three digits 010 are wire type = 0010 0000 0010 shifted to, that is, tag = 2.

Length is 7, followed by 7 bytes, that is, the character "testing" is created ".

6.4 nested message

Defines the following nested messages:

?
123 message Test3 {   required Test1 c = 3; }

Like the second part, set the field to an integer of 150 and the encoded bytes:

1a 03 <span style="color: red;">08 96 01</span>

We found that the last three bytes are the same as those in the first example (08 96 01). Their front side has a length limit of 03, the embedded message of the courseware is the same as that of the string, and the wire type is 2.

6.5 wire type = 3, 4

The two fields are no longer used, so ignore them ~

7. Optional and repeated Fields

If the defined message contains the repeated element and the [packed = true] option is not used after the declaration, one or more key-value pairs containing the same tag number are encoded. These repeated values do not need to appear consecutively; they may appear at intervals with other fields. Although they are unordered, they need to be ordered during parsing.

For optional fields, the key-value pairs with the numeric tag in the encoded message are optional.

Generally, the required and optional fields of the encoded message have only one instance at most. However, the parser needs to handle an extra case. For numeric and string types, if the same value appears multiple times, the parser accepts the value received by the last one. For embedded fields, the parser merges multiple instances of the same field it receives. Just like the mergefrom method, all the singular fields will replace the previous ones, and all the singular embedded messages will be merged (merge). All repeated fields will be connected in series. The result of this rule is to parse the encoded messages of two series, and parse the two messages and then merge respectively. The result is the same. For example:

?
12 MyMessage message; message.ParseFromString(str1 + str2);

This approach is equivalent:

?
1234 MyMessage message, message2; message.ParseFromString(str1); message2.ParseFromString(str2); message.MergeFrom(message2);

This method is sometimes very useful. For example, you can merge a message even if you do not know its type.

7.1 set the repeated field of [packed = true]

After 2.1.0, this type is introduced by Pb, which is the same as the repeated field, but [packed = true] is declared at the end. The repeated fields are different. For the packed repeated field, if there is no value in the message, it will not appear in the encoded data. Otherwise, all the elements of this field will be packaged into a single key-value pair, and its wire type is 2, and its length is determined. Each element is properly encoded, but there is no label before it. For example, the following message type is available:

?
123 message Test4 {     repeated int32 d = 4 [packed=true]; }

Construct a test4 field and set the repeated field d to 3, 270, and 86942. After encoding:

?
123456789 22 // tag 0010 0010(field number 010 0 = 4, wire type 010 = 2)   06 // payload size (set length = 6 bytes)  03 // first element (varint 3)   8E 02 // second element (varint 270)   9E A7 05 // third element (varint 86942)

Only the atomic numeric type (varint, 32-bit, or 64-bit) can be declared as "packed"

Note that for the packed repeated field, although there is usually no reason to encode it into multiple key-value pairs, the encoder must be prepared to receive multiple key-pair pairs. In this case, payload must be in series, and each pair must contain a complete element.

8. Field order

In short, there are only two points:

  1. Encoding/decoding is irrelevant to the field order, which can be ensured by the key-value mechanism.
  2. For an unknown field, it is written after the serialized known Field During encoding.

Recommended reading order, hope to bring you some benefits ~

Google protocol buffers Overview

Google protocol buffers getting started

Protocol buffers syntax Guide

Google protocol buffers encoding (encoding)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.