The use and principle of Google Protocol Buffer

Last Update:2016-02-05 Source: Internet

Author: User

Tags deprecated xml parser sourcetree

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

This article is reproduced from ibm:http://www.ibm.com/developerworks/cn/linux/l-cn-gpb/

Brief introduction

What is Google Protocol Buffer? If you search online, you should get a text like this:

Google Protocol Buffer (protobuf) is a mixed-language data standard within Google, with more than 48,162 message format definitions and more than 12,183. proto files already in use. They are used for RPC systems and continuous data storage systems.

Protocol buffers is a lightweight and efficient structured data storage format that can be used for structured data serialization, or serializing. It is ideal for data storage or RPC data interchange formats. It can be used in the communication protocol, data storage and other fields of the language-independent, platform-independent, extensible serialization structure data format. The API is currently available in three languages of C + +, Java, and Python.

Perhaps you and I, after reading these introductions for the first time, still do not understand what protobuf is, then I think a simple example should be more helpful to understand it.

A simple example of installing Google Protocol Buffer

You can download the PROTOBUF source code on the website http://code.google.com/p/protobuf/downloads/list. Then unzip the installation to use it.

The installation steps are as follows:

TAR-XZF protobuf-2.1.0.tar.gz  cd protobuf-2.1.0  ./configure--prefix= $INSTALL _dir make make  check Make  Install

A description of a simple example

I'm going to use PROTOBUF and C + + to develop a very simple example program.

The program is made up of two parts. The first part is called Writer, the second part is called Reader.

Writer is responsible for writing some structured data to a disk file, and reader is responsible for reading the structured data from the disk file and printing it to the screen.

The structured data prepared for demonstration is HelloWorld, which contains two basic data:

ID, which is an integer type of data
STR, which is a string

Write A. proto file

First we need to write a proto file that defines the structured data that we need to process in our program, and in the PROTOBUF terminology, structured data is called a Message. Proto files are very similar to Java or C-language data definitions. Listing 1 shows the contents of the proto file in the example application.

Listing 1. Proto file

Package LM;  Message HelloWorld  {     required int32     id = 1;  ID     Required String    str = 2;  Str     optional int32     opt = 3;  Optional field  }

A good habit is to seriously treat the file name of the proto file. For example, the naming convention is set as follows:

PackageName.MessageName.proto

In the example above, the package name is LM, which defines a message helloworld that has three members, the ID of type int32, and the other is a member of type string str. Opt is an optional member, that is, the message may not contain the member.

Compiling the. Proto file

Once you have written the proto file, you can compile the file into the target language with the Protobuf compiler. In this example we will use C + +.

Suppose your proto file is stored under the $SRC _dir, and you want to place the resulting file in the same directory, you can use the following command:

protoc-i= $SRC _dir--cpp_out= $DST _dir $SRC _dir/addressbook.proto

The command generates two files:

Lm.helloworld.pb.h, defines the header file of the C + + class

lm.helloworld.pb.cc, implementation files for C + + classes

In the generated header file, a C + + class HelloWorld is defined, and later Writer and Reader will use this class to manipulate the message. such as assigning a value to a member of a message, serializing a message, and so on, has a corresponding method.

Writing writer and Reader

As mentioned earlier, writer writes a structured data to disk for others to read. If we don't use protobuf, there are many choices. One possible way is to convert the data to a string and then write the string to disk. The method of converting to a string can use sprintf (), which is very simple. The number 123 can become the string "123".

There seems to be nothing wrong with this, but if you think about it, you will find that this is a higher requirement for the person who wrote the reader, and that reader's author has to have the details of the writer. For example, "123" can be a single number 123, but it can also be three digits and 3, and so on. So, we also have to have Writer define a delimiter-like character so that reader can read it correctly. But the delimiter may also cause other problems. Finally, we find that a simple Helloworld also needs to write a lot of code to handle the message format.

If you use PROTOBUF, these details can be taken into account without the need for an application.

Working with Protobuf,writer is simple, and the structured data that needs to be handled is described by the. proto file, which, after the compilation process in the previous section, corresponds to a C + + class and is defined in Lm.helloworld.pb.h. For this example, the class name is Lm::helloworld.

Writer needs to include the header file, and then the class can be used.

Now, in the Writer code, the structured data that will be stored on disk is represented by an object of the Lm::helloworld class, which provides a series of get/set functions to modify and read data members in structured data, or field.

When we need to save this structured data to disk, class Lm::helloworld has provided a way to turn a complex data into a sequence of bytes that we can write to disk.

For a program that wants to read this data, it is only necessary to use the corresponding deserialization method of class Lm::helloworld to convert the byte sequence to structured data. This is similar to the idea of the "123" we started with, but Protobuf is far more comprehensive than our rough string conversion, so we might as well give protobuf this sort of thing to him.

The program Listing 2 shows the main code of Writer, and you will find it very simple.

Listing 2. Writer's main code

#include "lm.helloworld.pb.h" ... int main (void)  {     Lm::helloworld msg1;   MSG1.SET_ID (101);   Msg1.set_str ("Hello");       Write The new Address book back to disk.   FStream output ("./log", Ios::out | ios::trunc | ios::binary);           if (!MSG1. Serializetoostream (&output)) {       cerr << "Failed to write Msg." << Endl;       return-1;   }           return 0;  }

MSG1 is an object of the HelloWorld class, and set_id () is used to set the value of the ID. Serializetoostream writes an object to a fstream stream after it is serialized.

Code Listing 3 lists the main code for reader.

Listing 3. Reader

#include "lm.helloworld.pb.h" ... void listmsg (const Lm::helloworld & msg) {   cout << msg.id () << endl;
   cout << msg.str () << Endl;  }   int main (int argc, char* argv[]) {   Lm::helloworld msg1;    {     fstream input ("./log", Ios::in | ios::binary);     if (!MSG1. Parsefromistream (&input)) {       cerr << "Failed to the parse address Book." << Endl;       return-1;     }   }    Listmsg (MSG1);   ...  }

Similarly, reader declares the object of class HelloWorld MSG1 and then uses Parsefromistream to read information from a fstream stream and deserialize it. Thereafter, the listmsg uses the Get method to read the message's internal information and print out the operation.

Run results

The results of running Writer and Reader are as follows:

>writer  >reader  101  Hello

Reader reads the serialized information from the file log and prints it to the screen. All of the sample code in this article can be downloaded in the attachment. You can try it out for yourself.

The example itself is meaningless, but you can turn it into a more useful program as long as you modify it slightly. For example, if you replace a disk with a network socket, you can implement a network-based data Exchange task. Storage and exchange is the most effective application area of PROTOBUF.

Comparison with other similar technologies

After reading this simple example, I hope you can understand what protobuf can do, and you may say that there are many other similar technologies in the world, such as Xml,json,thrift and so on. What is the difference between protobuf and them?

In short, the main advantage of PROTOBUF is: simple, fast.

This is a test, the project Thrift-protobuf-compare compared these similar techniques, Figure 1 shows a test result of the project, total time.

Figure 1. Performance Test Results

Total time refers to the entire duration of an object operation, including creating an object, serializing the object into an in-memory sequence of bytes, and then deserializing the entire process. From the test results can be seen protobuf good results, interested readers can go to the site http://code.google.com/p/thrift-protobuf-compare/wiki/Benchmarking to learn more detailed test results.

Advantages of PROTOBUF

Protobuf is like XML, but it's smaller, faster, and simpler. You can define your own data structure, and then use the code generated by the generator to read and write the structure. You can even update the data structure without having to redeploy the program. Simply use Protobuf to describe your data structure once, and you can easily read and write your structured data from a variety of different languages or from a variety of different streams of data.

It has a very good feature of "backwards" compatibility, and people do not have to break the deployed, relying on the "old" data format of the program to upgrade the data structure. This way your program does not have to worry about large-scale code refactoring or migration issues caused by changes in the message structure. Because adding a field in a new message does not cause any changes to the program that has already been published.

Protobuf semantics are clearer, without the need for something like an XML parser (because the PROTOBUF compiler generates a corresponding data access class to compile the. proto file to serialize and deserialize the PROTOBUF data).

Using Protobuf without having to learn a complex document Object model, PROTOBUF's programming model is friendly, easy to learn, and it has good documentation and examples, and protobuf is more appealing to people who like simple things than other technologies.

Protobuf of the poor

PROTBUF also has shortcomings compared to XML. It is simple in function and cannot be used to represent complex concepts.

XML has become a multi-industry-standard authoring tool, and PROTOBUF is just a tool used internally by Google, and much less versatile.

Because text is not suitable for describing data structures, PROTOBUF is also not suitable for modeling text-based markup documents such as HTML. In addition, because XML has some degree of self-explanatory, it can be read directly by the editor, at this point protobuf not, it is stored in binary way, unless you have a. Proto definition, otherwise you cannot directly read protobuf any Content "2".

More complex Message for advanced application topics

So far, we have only given a simple example that is not of any use. In practical applications, it is often necessary to define a more complex Message. We use the word "complex", not just in terms of more fields or more types of fields, but more complex data structures:

Nested Message

Nesting is a magical concept, and once you have nested capabilities, the ability to express messages is very powerful.

Code Listing 4 gives an example of a nested Message.

Listing 4. Example of nested Message

Message person {   Required String name = 1;   Required Int32 id = 2;        Unique ID number for the this person.   Optional String email = 3;   Enum Phonetype {     MOBILE = 0;     HOME = 1;     Work = 2;   }   Message PhoneNumber {     Required String number = 1;     Optional Phonetype type = 2 [default = HOME];   }   Repeated PhoneNumber phone = 4;  }

In message person, a nested message phonenumber is defined and used to define the phone domain in the person message. This allows people to define more complex data structures.

4.1.2 Import Message

In a. proto file, you can also use the Import keyword to introduce messages that are defined in other. proto files, which can be called import message, or Dependency message.

For example, the following example:

Listing 5. Code

Import Common.header;  Message youmsg{   Required Common.info_header header = 1;   Required String youprivatedata = 2;  }

其中 ,Common.info_header 定义在 Common.header包内。

The main use of Import Message is to provide a convenient code management mechanism, similar to the C language of the header file. You can define some of the common messages in a package, and then introduce it in another. proto file to use the message definition in it.

Google Protocol Buffer is a great way to support nesting of message and the introduction of message, making it very easy to define complex data structures.

Dynamic compilation

In general, people who use PROTOBUF will first write the. proto file, and then use the PROTOBUF compiler to generate the required source code files for the target language. Compile the generated code and the application together.

However, in some cases, people cannot know beforehand. proto files, they need to process some unknown. proto files dynamically. For example, a generic message-forwarding middleware, it is impossible to predict what to do with the message. This requires the dynamic compilation of the. proto file and the use of the Message in it.

PROTOBUF provides Google::p Rotobuf::compiler package to complete the dynamic compilation function. The main class is called importer, which is defined in Importer.h. Using importer is very simple, showing the relationship to Import and several other important classes.

Figure 2. Importer class

The Import class object contains three primary objects, one for the Multifileerrorcollector class that handles the error, and the Sourcetree class that defines the. proto file source directory.

The following is an example of how these classes are related and used.

For a given proto file, such as Lm.helloworld.proto, it requires very little code to dynamically compile it in the program. As shown in Listing 6.

Listing 6. Code

Google::p rotobuf::compiler::multifileerrorcollector errorcollector; Google::p rotobuf::compiler::D isksourcetree Sourcetree;  Google::p rotobuf::compiler::importer Importer (&sourcetree, &errorcollector);  Sourcetree.mappath ("", protosrc);  Importer.import ("Lm.helloworld.proto");

First, construct a importer object. The constructor requires two entry parameters, one of which is the source tree object that specifies the origin directory where the. proto file is stored. The second parameter is an error collector object that has a Adderror method that handles syntax errors encountered when parsing. proto files.

Then, when you need to dynamically compile a. proto file, you simply call the import method of the Importer object. Very simple.

So how do we use the dynamically compiled Message? We need to first understand a few other classes

Package Google: The following classes are available in the:p Rotobuf::compiler to represent the message defined in a. proto file, as well as the field in the message.

Figure 3. The relationship between the various Compiler classes

Class FileDescriptor represents a compiled. proto file; Class descriptor to a message in the expected file; Class Fielddescriptor describes a specific Field in a message.

For example, after compiling Lm.helloworld.proto, you can get the definition of lm.helloworld.id by the following code:

Listing 7. The code that gets the definition of lm.helloworld.id

Const PROTOBUF::D escriptor *desc =     importer_.pool ()->findmessagetypebyname ("Lm.helloworld");  Const protobuf::fielddescriptor* field =     Desc->pool ()->findfilebyname ("id");

Through the various methods and properties of Descriptor,fielddescriptor, the application can obtain various information about the message definition. For example, you get the field name by Field->name (). This allows you to use a dynamically defined message.

Writing a new Proto compiler

Compiler Protoc with Google Protocol Buffer source code supports 3 programming languages: C++,java and Python. But with Google Protocol Buffer's Compiler package, you can develop new compilers that support other languages.

Class Commandlineinterface encapsulates the front end of the Protoc compiler, including the parsing of command-line arguments, the compilation of proto files, and other functions. All you need to do is derive classes that implement class CodeGenerator and implement back-end work such as code generation:

Broad framework of the program:

Figure 4. XML Compiler block Diagram

Within the main () function, generate the Commandlineinterface object CLI and call its Registergenerator () method to register the backend code generator Yourg object for the new language with the CLI object. Then call the CLI's Run () method.

The resulting compiler and PROTOC use the same method, accept the same command-line arguments, and the CLI will perform the parsing of the word-French method for the user input. Proto, and eventually a syntax tree. The structure of the tree.

Figure 5. Syntax tree

Its root node is a FileDescriptor object (refer to the "Dynamic compilation" section) and is passed in as an input parameter to the Yourg Generator () method. Within this method, you can traverse the syntax tree and generate the corresponding code that you need. Simply put, to implement a new compiler, you simply write a main function, and a derived class that implements the method Generator ().

In the download attachment for this article, there is a reference example to compile the. proto file to generate the XML compiler, which can be used as a reference.

More details on Protobuf

It has been emphasized that, compared with XML, the main advantage of PROTOBUF is its high performance. It is stored in an efficient binary way, 3 to 10 times times smaller than XML, and 20 to 100 times times faster.

For these "small 3 to 10 times times", "fast 20 to 100 times times", the serious programmer needs an explanation. So at the end of this article, let's go a little deeper into PROTOBUF's internal implementation.

There are two techniques that ensure that a program using PROTOBUF can achieve significantly higher performance than XML.

1th, we can examine the information content of Protobuf after serialization. You can see that the representation of the Protocol Buffer information is very compact, which means that the volume of messages is reduced and naturally requires fewer resources. For example, the number of bytes transmitted on the network is less, requiring less IO, and so on, thus improving performance.

2nd we need to understand the general process of PROTOBUF encapsulation, and understand why it is much faster than XML.

Google Protocol Buffer Encoding

The binary messages generated after PROTOBUF serialization are very compact, thanks to the very ingenious Encoding method used by Protobuf.

Before examining the message structure, let me first introduce a term called varint.

Varint is a compact way to represent numbers. It uses one or more bytes to represent a number, and the smaller the number, the smaller the number of bytes. This reduces the number of bytes used to represent the number.

For example, for int32 types of numbers, it typically takes 4 bytes to represent them. However, with Varint, a small number of int32 types can be represented by 1 bytes. Of course everything has good and bad side, using varint notation, large numbers need 5 byte to represent. From a statistical point of view, generally not all of the numbers in the message are large numbers, so in most cases, with varint, you can use a smaller number of bytes to represent the digital information. Here is a detailed introduction to Varint.

The highest bit of each byte in the varint has a special meaning, if the bit is 1, the subsequent byte is also part of the number, and if the bit is 0, the end. The other 7 bits are used to represent numbers. Therefore, a number less than 128 can be represented by a byte. A number greater than 128, such as 300, is represented by two bytes: 1010 1100 0000 0010

Demonstrates how Google Protocol Buffer resolves two bytes. Note that the position of the two byte is exchanged once before the final calculation, because the Google Protocol Buffer byte order takes the form of a Little-endian.

Figure 6. Varint encoding

When the message is serialized, it becomes a binary data stream in which the data is a series of key-value pairs. As shown in the following:

Figure 7. Message Buffer

Using this key-pair structure eliminates the need to use separators to split different Field. For an optional field, if the field does not exist in the message, the field is not available in the resulting message Buffer, and these features help to save the size of the messages themselves.

Take the message in code Listing 1 as an example. Suppose we generate the following message Test1:

Test1.id = ten;  Test1.str = "Hello";

Then there are two key-value pairs in the final message Buffer, one for the ID in the corresponding messages, and the other for Str.

The key is used to identify the specific field, and when unpacking, Protocol Buffer can know according to key that the corresponding Value should correspond to which field in the message.

Key is defined as follows:

(Field_number << 3) | Wire_type

You can see that the Key is made up of two parts. The first part is field_number, such as the Field_number of field ID in message Lm.helloworld is 1. The second part is Wire_type. Represents the transport type of Value.

The possible types of wire type are shown in the following table:

Table 1. Wire Type

Type th>	meaning	used for
0 /td>	varint	int32, Int64, UInt32, UInt64, Sint32, Sint64, bool, enum
1	64-bit	fixed64, SFIXED64, double
2	length-delimi	string, bytes, embedded messages, packed repeated fields
3	Start Group	Groups (deprecated)
4	End Group	Groups (deprecated)
5	32-bit	fixed32, sfixed32, float

In our example, the field ID takes a data type of Int32, so the corresponding wire type is 0. Careful readers may see two very similar data types, Int32 and Sint32, in the data types that type 0 can represent. Google Protocol Buffer The main intention of distinguishing them is also to reduce the number of bytes after encoding.

Within a computer, a negative number is generally represented as a large integer, because the computer defines a negative sign bit as the highest digit. If you use Varint to represent a negative number, then you must have 5 bytes. For this reason Google Protocol Buffer defines the type of sint32, which is zigzag encoded.

ZIGZAG encoding uses unsigned numbers to denote signed digits, positive and negative numbers interlaced, which is the meaning of the word Zigzag.

：

Figure 8. ZIGZAG encoding

Using zigzag encoding, a number with a small absolute value, either positive or negative, can be expressed in less than byte, making full use of the varint technique.

Other data types, such as strings, are represented by varchar in a similar database, that is, the length is represented by a varint, and then the remainder is immediately followed by the length section.

Through the above on the Protobuf Encoding method Introduction, presumably you have also found PROTOBUF message content is small, suitable for network transmission. If you lack patience and interest in the description of technical details, the following simple and straightforward comparison should give you a deeper impression.

For the message in Code Listing 1, the sequence of bytes serialized with PROTOBUF is:

Geneva 6C 6C 6F 77

And if you use XML, it looks like this:

3C 2F 3E 6E 3C 6D 64 3E 6C All-in 6C 6F 3C 2F 6E 6D 3E-3C 2F 6C-6C  3E altogether 55 bytes, these strange numbers need to be explained a little bit, the meanings of which are expressed in ASCII as follows: 
Speed of sealing packets
First, let's take a look at the XML encapsulation process. XML needs to read out the string from the file and convert it to the XML Document object structure model. The string of the specified node is then read from the XML Document object structure model, and the string is then converted to a variable of the specified type. This process is very complex, in which the process of converting an XML file into a Document object structure model usually requires the completion of a complex computation of CPU complexity, such as word-French analysis.
PROTOBUF, it simply requires a binary sequence to be read in the specified format to the corresponding struct type in C + +. From the description in the previous section, you can see that the decoding process of a message can also be done by means of an expression consisting of several displacement operations. The speed is very fast.
To illustrate that this is not the way I think about it, let's briefly analyze the code flow of Protobuf unpacking.
For example, in code Listing 3, the program first calls MSG1 's Parsefromistream method, which parses the binary data stream that is read from the file and assigns the parsed data to the corresponding data member of the HelloWorld class.
The process can be represented by:
Figure 9. Unpacking flowchart
The entire parsing process requires PROTOBUF's own framework code and the code generated by the PROTOBUF compiler to do the same. PROTOBUF provides a base class of Message and Message_lite as a generic Framework,,codedinputstream class, Wireformatlite class provides decode functionality for binary data, from 5.1 sections Analysis, Protobuf decoding can be done by a few simple mathematical operations, without complex lexical parsing, so Readtag () and other methods are very fast. Other classes and methods on this call path are very simple and interested readers can read them by themselves. Compared to the parsing process of XML, the above flowchart is very simple? This is the second reason for the high efficiency of protobuf.
Conclusion
Often the more you know, the more people will feel they are ignorant. I am afraid to find myself unexpectedly wrote an article about the serialization, the text must have a lot of things to take for granted and self-righteous, but also hope that you can quweicunzhen, more hope that the real master can be generous to enlighten me, letter to me. Thank you.
The use and principle of Google Protocol Buffer

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More