Protocol Buffer Technical Explanation (language specification)
The content body of this series of blogs originates mainly from the official documents of protocol buffer, while the code samples are drawn from the demo of a company's internal project currently under development. The purpose of this is to not only maintain the good style and system of Google documents, but also combine some more practical and common use cases, so that the company's internal training, as well as the vast number of users of technical exchanges. It should be stated that the content of the blog is not the translation of line by line, which contains some empirical summary, at the same time, for some not very commonly used features are not explained, interested developers can directly access Google's official documents.
first, why use protocol Buffer?
Before we answer this question, we will give you a system scenario that is often encountered in real-world development. For example: Our client programs are developed in Java and may run from different platforms, such as Linux, Windows, or Android, and our server programs are usually based on the Linux platform and are developed using C + +. There are several ways to design message formats for data communication between these two programs, such as:
1. Direct transfer of one-byte-aligned structure data in the C + + language, as long as the struct is declared as a fixed-length format, then this is very convenient for C + + programs, only the data received by the struct type forced conversion. In fact, it is not very troublesome to have a variable length structure. When you send data, you simply define a struct variable and set the values of the individual member variables, and then send the binary data to the far end in a char* manner. Conversely, this approach is very cumbersome for Java developers, first of all the received data needs to be stored in Bytebuffer, and then read each field according to the agreed byte order, and then assign the read value to another value object in the domain variable, in order to facilitate the programming of other code logic. For this type of program, the benchmark of the alignment is that both the client and the server must finish writing the message building program before it can be expanded, and the design will directly lead to the development of Java program too slow progress. Even in the debug phase, it is common to encounter small errors in Java programs that appear in various fields of field stitching.
2. Using the SOAP Protocol (WEBSERVICE) as the format vector of message messages, the generated messages are based on text format, and there are a lot of XML descriptive information, so it will greatly increase the burden of network IO. And because of the complexity of XML parsing, this also greatly reduces the performance of message parsing. In short, the use of this design will make the overall performance of the system decreased significantly.
Protocol buffer can be a good solution to the problems arising from both of these approaches, and the Protocol buffer has a very important advantage to ensure compatibility between the old and new versions of the same message. As for the specific way we will be given in the following blog.
Second, define the first protocol buffer message.
Create a file with a. proto extension, such as: Mymessage.proto, and save the following in the file.
message Logonreqmessage {
required Int64 acctid = 1;
Required String passwd = 2;
}
The key notes for the above message definitions are given here.
1. The message is a keyword defined by messages, equivalent to Struct/class in C + + or class in Java.
2. Logonreqmessage is the name of the message, equivalent to the struct name or class name.
3. The required prefix indicates that the field is a required field, which must already be assigned before serialization and deserialization. At the same time, there are two other similar keywords in protocol buffer, optional and repeated, and the message field with both qualifiers does not have a limit of required fields. Used primarily to represent array fields compared to optional,repeated. Specific usage patterns are listed in the following use cases.
4. Int64 and strings represent long and string message fields, and in protocol buffer there is a type comparison table that protocol the data type in buffer against the type used in other programming languages (C++/java). The table also shows which type is more efficient in different data scenarios. The table will be given at a later time.
5. Acctid and passwd, respectively, represent the name of the message field, which is equivalent to the domain variable name in Java or the member variable name in C + +.
6. Label numbers1And2Represents the layout position of the different fields in the serialized binary data. In this example, the data after the passwd field is encoded must be after acctid. It is important to note that this value cannot be duplicated in the same message. In addition, for protocol buffer, a field labeled with a value of 1 to 15 can be optimized for encoding, with both the tag value and the type information occupying only one byte, and the label range 16 to 2047 will occupy two bytes, and protocol The number of fields that buffer can support is 2 minus one for 29. In view of this, when we design the message structure, we can consider as far as possible that the repeated type of field label is between 1 and 15, so that the number of bytes encoded can be effectively saved.
Third, define a second (containing the enumeration field) Protocol buffer message.
//When defining a message in protocol buffer, you can add comments in the same way as the C++/java code.
enum UserStatus {
OFFLINE = 0; Represents a user in an offline state
ONLINE = 1; Represents a user in an online state
}
message UserInfo {
required Int64 acctid = 1;
Required String name = 2;
Required UserStatus status = 3;
}
This will give you a critical description of the message definition above (only those not described in the previous section).
1. An enum is a keyword that is defined by an enumeration type and is equivalent to an enum in C++/java.
2. UserStatus is the name of the enumeration.
3. Unlike enumerations in C++/java, the delimiter between enumeration values is a semicolon, not a comma.
4. Offline/online is the enumeration value.
5.0 and 1 represent the actual integer value corresponding to the enumeration value, and as with C + +, you can specify an arbitrary integer value for the enumeration value without always starting from the 0 definition. Such as:
enum Operationcode {
Logon_req_code = 101;
logout_req_code = 102;
retrieve_buddies_req_code = 103;
logon_resp_code = 1001;
Logout_resp_code = 1002;
retrieve_buddies_resp_code = 1003;
}
Iv. Define a third (containing nested message fields) Protocol buffer message.
We can define multiple messages in the same. proto file, which makes it easy to implement the definition of a nested message. Such as:
enum UserStatus {
OFFLINE = 0;
ONLINE = 1;
}
message UserInfo {
required Int64 acctid = 1;
Required String name = 2;
Required UserStatus status = 3;
}
message Logonrespmessage {
required Loginresult logonresult = 1;
required UserInfo UserInfo = 2;
}
This will give you a critical description of the message definition above (only those not described in the previous two subsections).
1. The definition of the Logonrespmessage message contains another message type as its field, such as UserInfo UserInfo.
2. The UserInfo and Logonrespmessage in the previous example are defined in the same. proto file, so can we include the message defined in the other. proto file? Protocol buffer provides another keyword import so that we can define many generic messages in the same. proto file, while other message definition files can be included in the message defined in the file by import, such as:
Import "Myproject/commonmessages.proto"
the basic rules of the qualifier (required/optional/repeated).
1. At least one field of type required must be left in each message.
2. Each message can contain 0 or more fields of the optional type.
3. Repeated represents a field that can contain 0 or more data. It is important to note that this differs from the array in C++/java because the array in the latter two must contain at least one element.
4. If you intend to add a new field to the original message agreement, and also to ensure that the old version of the program can read or write normally, the newly added field must be optional or repeated. The reason is very simple, the old version of the program cannot read or write the new required qualifier field.
Six, type comparison table.
. Proto Type |
Notes |
C + + Type |
Java Type |
Double |
|
Double |
Double |
Float |
|
Float |
Float |
Int32 |
Uses variable-length encoding. Inefficient for encoding negative numbers–if your field was likely to has negative values, use Sint32 instead. |
Int32 |
Int |
Int64 |
Uses variable-length encoding. Inefficient for encoding negative numbers–if your field was likely to has negative values, use Sint64 instead. |
Int64 |
Long |
UInt32 |
Uses variable-length encoding. |
UInt32 |
Int |
UInt64 |
Uses variable-length encoding. |
UInt64 |
Long |
Sint32 |
Uses variable-length encoding. Signed int value. These more efficiently encode negative numbers than regular int32s. |
Int32 |
Int |
Sint64 |
Uses variable-length encoding. Signed int value. These more efficiently encode negative numbers than regular int64s. |
Int64 |
Long |
Fixed32 |
Always four bytes. More efficient than uint32 if values is often greater than 228. |
UInt32 |
Int |
Fixed64 |
Always eight bytes. More efficient than UInt64 if values is often greater than 256. |
UInt64 |
Long |
Sfixed32 |
Always four bytes. |
Int32 |
Int |
Sfixed64 |
Always eight bytes. |
Int64 |
Long |
bool |
|
bool |
Boolean |
String |
A string must always contain UTF-8 encoded or 7-bit ASCII text. |
String |
String |
bytes |
may contain any arbitrary sequence of bytes. |
String |
ByteString |
vii. Protocol Buffer message escalation principle.
In the actual development there will be a scenario in which the message format has to be upgraded due to changes in some requirements, but some applications that use the original message format cannot be upgraded immediately, which requires us to follow certain rules when upgrading message formats. This ensures that new and old programs are running simultaneously based on the new and old message formats. The rules are as follows:
1. Do not modify the label number of the field that already exists.
2. Any newly added fields must be optional and repeated qualifiers, or the new and old programs will not be guaranteed message compatibility when they pass messages to each other.
3. In the original message, you cannot remove the existing required field, the fields of optional and repeated types can be removed, but the tag numbers they used before must be preserved and cannot be reused by new fields.
4. Int32, UInt32, Int64, UInt64, and bool are compatible between types, Sint32 and Sint64 are compatible, string and bytes are compatible, FIXED32 and SFIXED32, and FIXED64 and SFIXED64 are compatible, which means that if you want to modify the type of the original field, you can only modify it to a type that is compatible with its original type for compatibility, otherwise the compatibility of the new and old message format will be broken.
5. The optional and repeated qualifiers are also mutually compatible.
Eight, Packages.
We can define the package name in the. proto file, such as:
Package ourproject.lyphone;
When the package name is generated for the corresponding C + + file, it is replaced with the namespace name, both namespace Ourproject {namespace Lyphone. In the generated Java code file, it will be the package name.
Nine, Options.
Protocol buffer allows us to define some common options in the. proto file, which instructs the Protocol buffer compiler to help us generate a more matching target language code. Protocol buffer The built-in options are divided into the following three levels:
1. File level, such options will affect all messages and enumerations defined in the current file.
2. Message level, such options affect only a message and all fields it contains.
3. Field level, such that the option responds only to fields related to it.
Some of the commonly used protocol buffer options are given below.
1. Option Java_package = "Com.companyname.projectname";
Java_packageis a file-level option by specifying this option to have the package name of the generated Java code the option value, as the Java code package in the previous example is named Com.companyname.projectname. At the same time, the generated Java files will be automatically stored in the Com/companyname/projectname subdirectory under the specified output directory. If this option is not specified, the Java package name is the name specified by the Packages keyword. This option has no effect on generating C + + code.
2. Option Java_outer_classname = "Lyphonemessage";
Java_outer_classnameis the file-level option, and the primary function is to display the name of the external class that specifies the generated Java code. If this option is not specified, the Java code's external class name is the file name part of the current document, and the file name is also converted to the camel's hump format, such as: My_project.proto, the default external class name for the files will be myproject. This option has no effect on generating C + + code.
Note: Primarily because Java requires the same. java file to contain only one Java external class or external interface, and C + + does not have this restriction. Therefore, the messages defined in the. proto file are the inner classes of the specified external class, so that the messages can be generated into the same Java file. In the actual use, in order to avoid always entering the external class qualifier, you can introduce the external class statically into the current Java file, such as:Import Static com.company.project.lyphonemessage.*。
3. Option optimize_for = Lite_runtime;
optimize_foris a file-level option, Protocol buffer defines three levels of optimization speed/code_size/lite_runtime. Speed is the default.
Speed: Indicates that the generated code is running efficiently, but the resulting code will take up more space when compiled.
Code_size: Contrary to speed, code runs less efficiently, but the resulting code consumes less space and is typically used on platforms with limited resources, such as mobile.
Lite_runtime: The generated code executes efficiently, and the resulting code consumes very little space after compilation. This is at the expense of the reflection functionality provided by protocol buffer. So when we link the protocol buffer library in C + +, we only need to link libprotobuf-lite, not libprotobuf. In Java, you only need to include Protobuf-java-2.4.1-lite.jar, not Protobuf-java-2.4.1.jar.
Note: For the lite_message option, the generated code will inherit from Messagelite, not the message.
4. [Pack= True]: For historical reasons, the numeric repeated fields, such as Int32, Int64, etc., are not well optimized for encoding, but in a recent version of protocol buffer, you can add the [pack=true] Field option To inform protocol that buffer is more efficient when encoding a message object of that type. Such as:
Repeated int32 samples = 4 [Packed=true].
Note: This option is only available for2.3.0Above the protocol Buffer.
5. [default= Default_value]: A field of type optional, if it is not set at serialization, or if the field does not exist at all in the old version of the message, then the message in deserialization of that type is that the optional field will be given the type-dependent default value. If BOOL is set to False,int32 is set to 0. Protocol buffer also supports custom default values, such as:
Optional Int32 result_per_page = 3 [default = 10].
10. Command-line compilation tool.
protoc --proto_path=import_path--cpp_out=dst_dir--java_out=dst_dir--python_out=dst_dir path/to/ File.proto
A parameter explanation of the above command is given here.
1. PROTOC provides the command-line compilation tool for protocol buffer.
2.--proto_path is equivalent to the-I option, which is used primarily to specify the directory where the. Proto message definition file is to be compiled, which can be specified more than once.
3. The--cpp_out option means generating C + + code,--java_out means generating Java code,--python_out is generating Python code, and the directory behind it is the directory where the generated code resides.
4. Path/to/file.proto represents the message definition file to be compiled.
Note: for C + +, with the Protocol Buffer compilation tool, each. proto file can be generated with a pair of. h and. CC C + + code files. The resulting file can be loaded directly into the project where the application is located. such as: Mymessage.proto generated files are mymessage.pb.h and MyMessage.pb.cc.
Protocol Buffer Technical Explanation (language specification)