Introduction
What is Google protocol buffer? If you search on the Internet, you will get an introduction similar to this:
Google protocol buffer (protobuf for short) is a standard for Google's internal hybrid language data. Currently, more than 48,162 types of message formats and more than 12,183. proto files are being used. They are used in RPC and continuous data storage systems.
Protocol buffers is a lightweight and efficient structured data storage format that can be used for serialization or serialization of structured data. It is suitable for data storage or RPC data exchange formats. It can be used for language-independent, platform-independent, and scalable serialized structure data formats in communication protocols, data storage, and other fields. Currently, APIs in C ++, Java, and Python are provided.
Maybe you are the same as me. After reading these introductions for the first time, you still don't understand what protobuf is, so I think a simple example should be helpful to understand it.
Back to Top
A simple example of installing Google protocol Buffer
Download the source code of protobuf from http://code.google.com/p/protobuf/downloads/list. Decompress, compile, and install the tool to use it.
The installation steps are as follows:
tar -xzf protobuf-2.1.0.tar.gz cd protobuf-2.1.0 ./configure --prefix=$INSTALL_DIR make make check make install
Description of simple examples
I plan to use protobuf and C ++ to develop a very simple example program.
The program consists of two parts. The first part is called writer, and the second part is called reader.
Writer writes some structured data into a disk file, and reader reads the structured data from the disk file and prints it to the screen.
The structured data used for demonstration is helloworld, which contains two basic data:
- ID, which is an integer data type.
- STR, which is a string
Writing a. proto File
First, we need to write a proto file to define the structured data to be processed in our program. In protobuf terminology, structured data is called message. The proto file is very similar to the data definition in Java or C language. Code List 1 shows the content of the proto file in the example application.
Listing 1. proto File
package lm; message helloworld { required int32 id = 1; // ID required string str = 2; // str optional int32 opt = 3; //optional field }
A good habit is to take the file name of the proto file seriously. For example, set the naming rules as follows:
packageName.MessageName.proto
In the preceding example, the package name is Lm, which defines a message helloworld. The message has three members, whose type is int32 ID, and another member whose type is string. OPT is an optional Member, that is, the message may not contain this member.
Compile the. proto File
After writing the proto file, you can use the protobuf compiler to compile the file into the target language. In this example, we will use C ++.
If your proto file is stored under $ src_dir and you want to put the generated file in the same directory, run the following command:
protoc -I=$SRC_DIR --cpp_out=$DST_DIR $SRC_DIR/addressbook.proto
The command will generate two files:
Lm. helloworld. Pb. H, which defines the header file of the C ++ class
Lm. helloworld. Pb. CC, C ++ class implementation file
The generated header file defines a C ++ class helloworld, which will be used by writer and reader to operate messages. Such as assigning values to message members and serializing messages.
Write writer and Reader
As mentioned above, writer writes a structured data to a disk for others to read. If we do not use protobuf, there are also many options. One possible method is to convert the data to a string and then write the string to the disk. You can use sprintf () to convert a string. This is very simple. The number 123 can be changed to a string "123 ".
There seems to be nothing wrong with this, but after careful consideration, we will find that this method has a high requirement on the person who writes the reader, and the author of the reader must have the details of the writer. For example, "123" can be a single number of 123, but it can also be three numbers 1, 2, and 3. In this case, we must also let the writer define a character with the same separator so that the reader can read it correctly. However, the separator may cause other problems. Finally, we found that a simple helloworld also requires a lot of code to process message formats.
If you use protobuf, you do not need to consider these details.
With protobuf, writer is easy to work. description of the proto file. After the compilation process in the previous section, the data structure corresponds to a C ++ class and is defined in LM. helloworld. pb. h. In this example, the class name is Lm: helloworld.
Writer needs to include this header file, and then you can use this class.
In the writer code, the structured data to be stored in the disk is represented by an lM: helloworld class object, it provides a series of get/set functions to modify and read data members in structured data, or field.
When we need to save the structured data to a disk, lm: helloworld has provided a method to convert a complex data into a byte sequence, we can write this byte sequence to the disk.
For programs that want to read this data, they only need to use the corresponding deserialization method like LM: helloworld to re-convert the byte sequence to produce structured data. This is similar to the idea of "123" at the beginning, but protobuf is much more comprehensive than our rough String Conversion. Therefore, we should leave this kind of thing to protobuf with confidence.
Program list 2 demonstrates the main code of writer. Do you think it is very simple?
Listing 2. Main writer code
#include "lm.helloworld.pb.h"… int main(void) { lm::helloworld msg1; msg1.set_id(101); msg1.set_str(“hello”); // Write the new address book back to disk. fstream output("./log", ios::out | ios::trunc | ios::binary); if (!msg1.SerializeToOstream(&output)) { cerr << "Failed to write msg." << endl; return -1; } return 0; }
Msg1 is an object of the helloworld class. set_id () is used to set the id value. Serializemedistream serializes an object and writes it to an fstream.
Code List 3 lists the main code of reader.
Listing 3. Reader
#include "lm.helloworld.pb.h" … void ListMsg(const lm::helloworld & msg) { cout << msg.id() << endl; cout << msg.str() << endl; } int main(int argc, char* argv[]) { lm::helloworld msg1; { fstream input("./log", ios::in | ios::binary); if (!msg1.ParseFromIstream(&input)) { cerr << "Failed to parse address book." << endl; return -1; } } ListMsg(msg1); … }
Similarly, reader declares the object msg1 of the helloworld class, and then uses parsefromistream to read information from an fstream and deserialize it. Then, the listmsg uses the get method to read the internal information of the message and print the output.
Running result
The result of running writer and reader is as follows:
>writer >reader 101 Hello
Reader reads the serialization information in the file log and prints it to the screen. All the sample code in this article can be downloaded from the attachment. You can try it yourself.
This example is meaningless, but you can change it into a more useful program with a slight modification. For example, if you replace a disk with a network socket, You can implement a network-based data exchange task. Storage and exchange are the most effective application fields of protobuf.
Back to Top
Comparison with other similar technologies
After reading this simple example, I hope you can understand what protobuf can do. You may say that there are many other similar technologies in the world, such as XML, JSON, and thrift. What is the difference between protobuf and them?
To put it simply, protobuf has the following advantages: simplicity and speed.
This test proves that the Thrift-protobuf-compare project compares these similar technologies. Figure 1 shows a test result of the project, total time.
Figure 1. Performance Test Results
Total time refers to the entire process of an object operation, including creating an object, serializing the object into a byte sequence in memory, and then deserializing it. From the test results, we can see that protobuf has a good performance. Interested readers can go to the website http://code.google.com/p/thri?protobuf-compare/wiki/benchmarkingto learn more detailed test results.
Advantages of protobuf
Protobuf is like XML, but it is smaller, faster, and simpler. You can define your own data structure, and then use the code generated by the Code Generator to read and write this data structure. You can even update the data structure without re-deploying the program. You can easily read and write your structured data in different languages or from different data streams by using protobuf to describe the data structure once.
It has a very good feature, that is, the "backward" compatibility is good, and people do not have to destroy the deployed programs that rely on the "old" data format to upgrade the data structure. In this way, your program does not have to worry about large-scale code refactoring or migration problems caused by changes in the message structure. Because adding a field to a new message does not cause any changes to the released program.
Protobuf has clearer semantics and does not require anything similar to the XML Parser (because protobuf compiler will compile the. proto file to generate the corresponding data handler class to serialize and deserialize protobuf data ).
Protobuf does not need to learn complex document object models. protobuf's programming mode is friendly and easy to learn. It also has good documents and examples. For people who like simple things, protobuf is more attractive than other technologies.
Protobuf Deficiency
Protbuf has some shortcomings compared with XML. It has simple functions and cannot be used to express complicated concepts.
XML has become a tool for writing a variety of industry standards. protobuf is just a tool used inside Google, and it is still far inferior in terms of universality.
Because text is not suitable for describing data structures, protobuf is not suitable for modeling text-based markup documents (such as HTML. In addition, because XML is self-explanatory to some extent, it can be directly read and edited. protobuf does not work at this point. It is stored in binary mode unless you have. PROTO definition. Otherwise, you cannot directly read any content of protobuf [2 ].
Back to Top
More complex message for advanced application topics
So far, we have provided a simple example that is useless. In practical applications, people often need to define more complex messages. We use the word "complex" not only refers to more fields or more types of fields in terms of number, but also to a more complex data structure:
Nested message
Nesting is a magical concept. Once nested, the message expression capability is very powerful.
Code list 4 provides an example of nested message.
Listing 4. Example of nested message
message Person { required string name = 1; required int32 id = 2; // Unique ID number for this person. optional string email = 3; enum PhoneType { MOBILE = 0; HOME = 1; WORK = 2; } message PhoneNumber { required string number = 1; optional PhoneType type = 2 [default = HOME]; } repeated PhoneNumber phone = 4; }
In message person, the nested message phonenumber is defined and used to define the phone domain in the person message. This allows people to define more complex data structures.
4.1.2 import message
In A. proto file, you can also use the import keyword to introduce messages defined in other. proto files, which can be called import message or dependency message.
For example:
Listing 5. Code
import common.header; message youMsg{ required common.info_header header = 1; required string youPrivateData = 2; }
Where,
Common.info _ HeaderDefined in
Common. HeaderPackage.
Import message is mainly used to provide a convenient code management mechanism, similar to the header file in C language. You can define some common messages in a package, introduce the package to other. proto files, and then use the message definition.
Google protocol buffer can support nested messages and introduce messages, making it easy and pleasant to define complex data structures.
Dynamic Compilation
Generally, people who use protobuf write the. proto file first, and then use the protobuf compiler to generate the source code file required by the target language. Compile the generated code with the application.
However, in some cases, people cannot know the. proto file in advance, and they need to dynamically process some unknown. proto files. For example, a general message forwarding middleware cannot predict what messages need to be processed. This requires dynamic compiling of the. proto file and using the message.
Protobuf provides the Google: protobuf: Compiler package for dynamic compilation. The main class is importer, which is defined in importer. h. The use of importer is very simple and shows the relationship with import and several other important classes.
Figure 2. Importer class
The import class object contains three main objects: The multifileerrorcollector class for error handling, and the sourcetree class for the. proto file source directory.
The following describes the relationships and usage of these classes through examples.
For a given proto file, such as LM. helloworld. proto, you only need a small amount of code to dynamically compile it in the program. See Code Listing 6.
Listing 6. Code
google::protobuf::compiler::MultiFileErrorCollector errorCollector; google::protobuf::compiler::DiskSourceTree sourceTree; google::protobuf::compiler::Importer importer(&sourceTree, &errorCollector); sourceTree.MapPath("", protosrc); importer.import(“lm.helloworld.proto”);
First, construct an importer object. The constructor requires two entry parameters, one of which is the source tree object, which specifies the source directory for storing the. proto file. The second parameter is an error collector object, which has an adderror method to handle syntax errors encountered when parsing the. proto file.
To dynamically compile a. proto file, you only need to call the import method of the importer object. Very simple.
So how do we use the dynamically compiled message? We need to first understand several other classes
Package Google: protobuf: Compiler provides the following classes to indicate the message defined in A. proto file and the field in the message ,.
Figure 3. Relationship between various compiler classes
The filedescriptor class indicates a compiled. proto file; the descriptor class indicates a message in the file; the fielddescriptor class describes a specific field in a message.
For example, after compiling LM. helloworld. proto, you can use the following code to get the definition of LM. helloworld. ID:
Listing 7. Get the code defined by LM. helloworld. Id.
const protobuf::Descriptor *desc = importer_.pool()->FindMessageTypeByName(“lm.helloworld”); const protobuf::FieldDescriptor* field = desc->pool()->FindFileByName (“id”);
Through various methods and attributes of Descriptor and fielddescriptor, applications can obtain various information about message definitions. For example, you can use field-> name () to obtain the field name. In this way, you can use a dynamically defined message.
Compile a new proto Compiler
The compiler protoc released with the Google protocol buffer source code supports three programming languages: C ++, Java, and python. However, with the compiler package of Google protocol buffer, you can develop a new compiler that supports other languages.
The commandlineinterface class encapsulates the front-end of the protoc compiler, including parsing command line parameters and compiling proto files. What you need to do is implement the derived class of the codegenerator class to implement backend work such as code generation:
General framework of the program:
Figure 4. xml compiler Diagram
In the main () function, generate the commandlineinterface object CLI and call its registergenerator () method to register the backend code generator yourg object of the new language to the CLI object. Call the run () method of CLI.
In this way, the compiler and protoc are used in the same way and the same command line parameters are accepted. cli will analyze the lexical Syntax of user input. proto, and finally generate a syntax tree. The structure of the tree.
Figure 5. syntax tree
Its root node is a filedescriptor object (see the "dynamic compilation" section) and is passed into yourg's generator () method as an input parameter. In this method, you can traverse the syntax tree and generate the corresponding code. To implement a new compiler, you only need to write a main function and a derived class that implements the method generator.
In the attachment to be downloaded in this article, a reference example is provided to compile the. proto file into XML compiler, which can be used as a reference.
Back to Top
More details about protobuf
It has been emphasized that protobuf has a high performance compared with XML. It stores data in binary mode efficiently, which is 3 to 10 times smaller than XML and 20 to 100 times faster.
Serious programmers need an explanation of these "3 to 10 times" and "20 to 100 times. So at the end of this article, let's go a little deeper into the internal implementation of protobuf.
There are two technologies to ensure that protobuf-based programs can greatly improve the performance compared with XML.
First, we can examine the serialized information of protobuf. You can see that the representation of protocol buffer information is very compact, which means that the message volume is reduced and fewer resources are required. For example, the number of bytes transmitted over the network is smaller, and the IO required is smaller, so as to improve performance.
Second, we need to understand the general process of protobuf unpacking to understand why protobuf is much faster than XML.
Encoding of Google protocol Buffer
The binary messages generated after protobuf serialization are very compact, thanks to protobuf's clever encoding method.
Before examining the message structure, let me first introduce a term called varint.
Varint is a compact numeric representation method. It uses one or more bytes to represent a number. The smaller the value, the fewer bytes are used. This reduces the number of bytes used to indicate numbers.
For example, for int32 numbers, four bytes are generally required. However, varint is used. For small int32 numbers, one byte can be used. Of course, everything is both good and bad. varint notation is used, and a large number is represented by five bytes. From the statistical point of view, generally, not all messages contain a large number of numbers. Therefore, in most cases, after varint is used, a smaller number of bytes can be used to represent numerical information. The following describes varint in detail.
The highest bit of each byte in varint has a special meaning. If this bit is 1, it indicates that the subsequent byte is also part of the number. If this bit is 0, it ends. The other 7 bits are used to represent numbers. Therefore, numbers smaller than 128 can be expressed in a byte. A number greater than 128, such as 300, is expressed in two bytes: 1010 1100 0000 0010
Demonstrate how Google protocol buffer parses two bytes. Note that the two bytes are exchanged once before the final calculation, because the Google protocol buffer uses the little-Endian method in the byte sequence.
Figure 6. varint Encoding
After the message is serialized, it becomes a binary data stream. The data in the stream is a series of key-value pairs. As shown in:
Figure 7. Message Buffer
Using this key-pair structure, you do not need to use separators to separate different fields. For optional fields, if the field does not exist in the message, the field is not in the final message buffer. These features help to save the size of the message.
Take the message in code listing 1 as an example. Suppose we generate the following message test1:
Test1.id = 10; Test1.str = “hello”;
The final message buffer contains two key-value pairs, one corresponding to the ID in the message and the other corresponding to Str.
Key is used to identify a specific field. When unpacking, protocol buffer knows which field in the message corresponding to the corresponding value based on the key.
The key is defined as follows:
(field_number << 3) | wire_type
The key consists of two parts. The first part is field_number. For example, the field_number of field ID in message LM. helloworld is 1. The second part is wire_type. The transmission type of value.
Possible types of wire types are shown in the following table:
Table 1. wire type
Type |
Meaning |
Used |
0 |
Varint |
Int32, int64, uint32, uint64, sint32, sint64, bool, Enum |
1 |
64-bit |
Fixed64, sfixed64, double |
2 |
Length-delimi |
String, bytes, embedded messages, packed repeated Fields |
3 |
Start Group |
Groups (Deprecated) |
4 |
End Group |
Groups (Deprecated) |
5 |
32-bit |
Fixed32, sfixed32, float |
In our example, field ID adopts the Data Type int32, so the corresponding wire type is 0. Careful readers may see that there are int32 and sint32 data types that can be represented by Type 0, which are very similar. The main intention of Google protocol buffer to distinguish them is to reduce the number of bytes After encoding.
In a computer, a negative number is generally expressed as a large integer, because the computer defines a negative number as the highest digit. If varint is used to indicate a negative number, five bytes are required. Therefore, Google protocol buffer defines the sint32 type, which adopts the zigzag encoding.
The zigzag encoding uses the unsigned number to represent signed numbers. Positive numbers and negative numbers are staggered. This is the meaning of the word zigzag.
:
Figure 8. Zigzag Encoding
Using zigzag encoding, numbers with small absolute values can be expressed with fewer bytes regardless of positive or negative values, making full use of varint technology.
Other data types, such as strings, use the varchar Representation Method in the database, that is, use a varint to represent the length, and then keep the rest after the length.
Based on the introduction to the protobuf encoding method, you must have found that the protobuf message content is small and suitable for network transmission. If you are not patient or interested in the descriptions of the technical details, the following simple and intuitive comparison should give you a better impression.
For messages in code list 1, the byte sequence after protobuf serialization is:
08 65 12 06 48 65 6C 6C 6F 77
If XML is used, it is similar to the following:
31 30 31 3C 2f 69 64 3E 3C 6e 61 6D 65 3E 68 65 6C 6C 6f 3C 2f 6e 61 6D 65 3E 3C 2f 68 65 6C 6C 6f 6f 77 6f 72 6C 64 3E total 55 bytes, the meanings of these strange numbers are described in ASCII: Speed of unpacking
First, let's take a look at the XML unpacking process. XML needs to read the string from the file and then convert it to the XML document object structure model. Then, read the string of the specified node from the XML document object structure model, and convert the string to a variable of the specified type. This process is very complicated. The process of converting an XML file into a document object structure model usually requires complex calculations such as lexical grammar analysis and a large amount of CPU consumption.
In contrast, protobuf simply reads a binary sequence to the corresponding structure type of C ++ in the specified format. From the description in the previous section, we can see that the decoding process of the message can also be completed through expression calculation composed of several displacement operations. Fast.
To demonstrate that this is not a casual remark, let's analyze the code process of protobuf unpacking.
Taking reader in listing 3 as an example, this program first calls the parsefromistream method of msg1, which resolves the binary data stream read from the file, and assign the parsed data to the corresponding data member of the helloworld class.
This process can be expressed as follows:
Figure 9. unpacking Flowchart
The entire parsing process requires the Framework Code of protobuf and the code generated by the protobuf compiler. Protobuf provides the basic message class and message_lite as the general framework, codedinputstream class, wireformatlite class, and so on. It provides the decode function for binary data. From the analysis in section 5.1, protobuf decoding can be completed through several simple mathematical operations without complex lexical syntax analysis. Therefore, readtag () and other methods are very fast. Other classes and methods on this call path are very simple. Interested readers can read them by themselves. Compared with the XML parsing process, the above flowchart is really very simple, right? This is the second reason for protobuf's high efficiency.
Back to Top
Conclusion
The more you know, the more you will feel ignorant. I am afraid that I have written an article about serialization. There must be many self-righteous things in this article. I also hope that you will be able to pretend to be true. I also hope that the real experts will not be enlightened, send me a letter. Thank you.