varints
Before understanding the PROTOBUF coding principle, the first thing to understand is varints.
Varints is a method of serializing integers with one or more bytes. The smaller the value of an integer, the less the number of bytes to occupy.
The first bit (MSB) of each byte of a varints is used to indicate whether the next byte is or is used to represent the integer. It can be viewed as a next pointer to a single-linked list, 1 points to the next node, and 0 to NULL. The 7 bits of the lower bit of varints store integers in twos complement, low-endian (small-endian) Form.
For example:
There are integers that are represented by the following varints:
1010 1100 0000 0010
To decode the original value, first remove each byte of the MSB (the MSB is just used to tell if the integer is not at the end, calculate the value is not used), and get the form without the MSB:
010 1100 000 0010
Then, since the varints is low-endian, a reverse (reverse) operation is required for the byte first, resulting in:
000 0010 010 1100
→100101100
→256 + + 8 + 4 = Message
PROTOBUF message is a series of key-value pairs, and the binary form of the message uses the field's tag value as the key, and its field name and the actual data type can only be determined at the decoding end based on the type definition of the message.
When you encode a message, key and value are concatenated into a byte stream. When decoding a message, the interpreter has the ability to skip fields that he does not recognize, so that new fields can be added to the message without affecting old programs that do not recognize the new fields.
To do this, each key on each byte stream is actually made up of two values-one is the tag number of the field and the other is
The wire type of the field, which is used to provide enough information to clarify the length of the field's worth.
The following is an optional wire type:
Type |
meaning |
used | for
0 |
Varint |
Int32, Int64, UInt32, UInt64, Sint32, Sint64, BOOL, enum |
1 |
64-bit |
FIXED64, SFIXED64, double |
2 |
length-delimited |
String, bytes, embedded messages, packed repeated fields |
3 |
Start Group |
Groups (deprecated) |
4 |
End Group |
Groups (deprecated) |
5 |
32-bit |
FIXED32, SFIXED32, float |
Each key on the byte stream is varint. The value of key is (Field tag Number<<3 | Wire_type), which is the last 3 bits used to store the wire type of the field.
For example, there is a definition of the following message:
Message Test1 {
Required Int32 a = 1;
}
In an application, a Test1 message,a value of 150 is generated. When this message is serialized into binary form, it gets 3 bytes:
08 96 01
The key in binary is 08, i.e.:
0000 1000
① a wire type of 0 (Varint) from the last 3 bits to determine the field
② then move the key to the right 3 bit, get 0000 0001,
Know that the tag number for this field is 1.
The values stored in the following two bytes are parsed according to Varint:
1001 0110 0000 0001
→000 0001 + + 001 0110 (remove MSB and invert bytes)
→10010110
→150
More data Types
signed integers
Signed integer types (sint32, Sint64) and "Standard" Integer types (Int32, Int64) are encoded into varints, but when their values are a negative number, there is a significant difference.
When you use Int32 and int64 to store negative numbers, you always produce a 10-byte varint--it's considered a very large unsigned integer.
If one of the signed integers (Sint32, sing64) is used, Varint are produced using zigzag encoding, which is more efficient than Int32 and Int64.
The ZIGZAG encoding uses "Zig-zags" to back up the forward method between positive and negative numbers, so that 1 is encoded into 1, 1 encoded into 2,-2 encoded into 3, and so on:
signed Original |
encoded | AS
0 |
0 |
-1 |
1 |
1 |
2 |
-2 |
3 |
2147483647 |
4294967294 |
-2147483648 |
4294967295 |
In other words, each value n is encoded according to the following formula:
For Sint32:
(n << 1) ^ (n >> 31)
For Sint64:
(n << 1) ^ (n >> 63)
Note that the second shift here,--n >> 31, is an arithmetic shift, that is, if n is positive, then the result of the shift is that all bits are 0, and if n is negative, all bits are 1.
When parsing sint32 or Sint64, its values are decoded into the original signed form.
As an example:
A =-1 → 0xFFFFFFFF (the computer uses the complement to indicate negative numbers)
a << 1 → 0xFFFFFFFE → -2
a >> 31→ 0xFFFFFFFF → -1
(a << 1) ^ (a >> +) → 0xFFFFFFFE ^ 0xffffffff→ 0x00000001< c15/>→ 1
Strings
Wire Type 2 (length-delimited) indicates that a value of this field consists of a varint n and a subsequent n-byte data.
Message Test2 {
Required String b = 2;
}
The value of Set B is "testing", and serialization gets:
6e 67
The red byte is UTF8 encoded "testing", key is 0x12→tag = 2, type = 2,. The length has a varint representation, that is, 0x07→7, so the following 7 bytes are the string "testing". embbed Messages
When serializing, the inline message is actually handled the same way as the string.
Message Test3 {
required Test1 c = 3;
}
Set the A field of Test1 C to 150, serialization will get:
1a 03 08 96 01
As you can see, the last 3 bytes are actually the same as our previous example (08 96 01). Optional and repeated Elements
Normally, a serialized message does not appear the same non-repeated field multiple times. However, the parser sets the processing method for this situation.
For numeric and string types, if the same field appears more than once, the parser takes the last occurrence of the value as the final result.
For inline message, the parser merges multiple values from the same field, just as you would with the Message::mergefrom method.
These rules enable the analysis of two concatenated serialized message actually equal to parse the two message separately and then execute merge with them to get the final result. That
Mymessage message;
Message. Parsefromstring (str1 + str2);
And this is the same:
Mymessage message, Message2;
Message. Parsefromstring (STR1);
Message2. Parsefromstring (STR2);
Message. Mergefrom (Message2);
Packed repeated fields
For Proto2 message, if a repeated field is not set [Packed=true], then the serialization will have 0 or more key-value pairs, each containing a tag, as in the other type fields mentioned above. The key-value pairs of these repeated are not necessarily contiguous, and may appear interleaved with other fields. But the key value of each repeated is consistent with the position in which it appears and the order in which they actually should be.
From the 2.1.0 release, Protobuf introduced a packed repeated fields, declared in the same way as repeated fields, just need to set the [packed=true] option.
In Proto3, repeated fields is packed repeated fields by default.
Packed repeated field is different from the general repeated field, for multiple values, after serialization, only one key-value pair will be produced, and for each value it will be serialized according to its type (but without tag). All values are then stitched together and packaged into a value as wire type 2 (length-delimited).
For example
Message Test4 {
repeated int32 d = 4 [packed=true];
}
Generate a test4,d of {3, 270, 86942}. Serialization of the message will get:
x//Tag number 4, wire type 2 //length 6 bytes The first element is Varint 3
8E //The second element is Varint 270
9e A7 //3rd element is Varint 86942
Only the repeated field of the original numeric type (Varint, 32-bit, 64-bit) can be declared as "packed".