Go Protocol Buffers for C

Source: Internet
Author: User

I've been unhappy with the default design of Google protocol buffers. Generating a large lump of C + + code for each message type makes me uncomfortable. And the official did not provide C version, third-party C version also do not let me satisfied.

This design is difficult to make the binding of dynamic language, and most dynamic languages often do not have strong type checking, the way to generate code is not particularly beneficial, but there is a lot of performance loss (and usually do a bingding library comparison). The official Python library, for example, can generate these functions at runtime, according to Protocol, without having to generate code from offline tools.

I wrote a LUA version of the library last year. To be independent of the official version, I even wrote a parser for a. proto file with Lpeg. With about less than 100 lines of LUA code, you can parse the contents of the protocol within the. proto file. You can have the LUA Library load the protocol description file for this document directly. (This thing helped me a lot this time)

This time, I re-do the project, and encountered Protobuf agreement resolution problem, want to solve from scratch. At the beginning of last month, I wanted to write a pure LUA version with Luajit. Conjecture, the use of Luajit and FFI can achieve good performance. But when we're done, there's still a gap between the C + + version (about 25% ~ 33% of the C + + version), which is worse than the C + Lua binding I wrote last year. However, the combination of C code and Lua code that was written last year is too much. So I had the idea of re-writing a C implementation.

Half of the time, a netizen pointed out that there is a Googler recently also doing similar work. ΜPB this project is here. Here he wrote a big article explaining why doing such a thing is, in general, consistent with my original intention. However, his API design is not very good, I think it is too difficult to use. So this project doesn't prevent me from finishing my own part.

C version is very difficult to design the API, because C lacks the necessary data structure. And there is no garbage collection, and there is a lack of meta information for data types.

Thinking twice, I decided to provide two sets of APIs to meet different needs.

When the performance requirements are not too high, just meet the convenience of C language development needs, provide a set of easy-to-use API operation PROTOBUF format of message. I call it the message API.

There are basically two sets of APIs:

For encoded PROTOBUF messages, use the Rmessage-related API

struct Pbc_rmessage * pbc_rmessage_new (struct pbc_env * env, const char * typename, struct pbc_slice * slice); void pbc_rm Essage_delete (struct pbc_rmessage *); uint32_t pbc_rmessage_integer (struct pbc_rmessage *, const char *key, int index, UI nt32_t *hi);d ouble pbc_rmessage_real (struct pbc_rmessage *, const char *key, int index); const char * pbc_rmessage_string (struct pbc_rmessage *, const char *key, int index, int *sz); struct pbc_rmessage * pbc_rmessage_message (struct pbc_rmess Age *, const char *key, int index); int pbc_rmessage_size (struct pbc_rmessage *, const char *key);

For decoding messages, use the Wmessage related API

struct Pbc_wmessage * pbc_wmessage_new (struct pbc_env * env, const char *typename); void Pbc_wmessage_delete (struct PBC_WM Essage *); void Pbc_wmessage_integer (struct pbc_wmessage *, const char *key, uint32_t Low, uint32_t hi); void Pbc_wmessage_r EAL (struct pbc_wmessage *, const char *key, double v); void pbc_wmessage_string (struct pbc_wmessage *, const char *key, con St char * v, int len), struct pbc_wmessage * pbc_wmessage_message (struct pbc_wmessage *, const char *key); void * Pbc_wmessa Ge_buffer (struct pbc_wmessage *, struct pbc_slice * slice);

Pbc_rmessage_new and Pbc_rmessage_delete are used to construct and release pbc_rmessage structures. The sub-message, the string, that is taken out of the structure, is guaranteed to be life-time. This does not require the user to do too complex object construction and destruction work.

For repeated data, no additional new data types are introduced. Instead, all domains within the message are treated as repeated. This design can greatly streamline the required APIs.

We can use Pbc_rmessage_size to query how many times a field in the message has been repeated. If the message is not coded into this field, it can return 0 sense.

I've unified all the basic data types into three kinds: integer, string, real. The bool type is treated as an integer. The enum type can be either a string or an integer. When using pbc_rmessage_string, the name of the enum can be taken, and the ID is obtained with Pbc_rmessage_integer.

Pbc_rmessage_message can obtain a sub-message, the returned object does not have to be explicitly destroyed, its lifetime is attached to the parent node. This API can be returned correctly even if a message is not encoded into a sub-message. The subdomain that is removed from it will be the default value.

The integer does not distinguish between 32bit and 64bit numbers. When you can be sure that the integer you need can be described in 32bit, the last parameter of Pbc_rmessage_integer can be passed NULL, ignoring the high 32bit of data.

The use of wmessage is more like constantly pressing data into a message packet class that is not closed. After you have filled out the entire message, you can return a slice with pbc_wmessage_buffer. This slice contains the pointer and the length of buffer.

It is important to note that if you use Pbc_wmessage_integer to press a negative number, be sure to pass the high position-1. Because the interface treats incoming parameters as unsigned integers.

Consider the performance of some of the internal implementations, and the convenience of the pattern API mentioned later (if you do a full C/s communication with this library). It is recommended that all strings be appended with \\0 at the end. Because, when decoding, you can point the string pointer directly inside the packet, without the need for an additional copy.

Pbc_wmessage_string can be pressed into a string that is not \\0 terminated because the length of the pressed data is determined by the parameter. Of course you can also not calculate the length yourself. If the length parameter is passed <=0, the library will help you invoke strlen detection. and subtract the final length from this negative number. That is, if you pass-1, it will help you to push more into the last \\0 byte.

The Pattern API can achieve higher performance. Faster speeds and less memory usage. More importantly, for smaller message packets, if you use them properly, using the pattern API does not even trigger a memory allocation operation on the heap. All the temporary memory is on the stack when the API is working.

The relevant APIs are as follows:

struct Pbc_pattern * pbc_pattern_new (struct pbc_env *, const char * message, const char *format, ...); void Pbc_pattern_delete (struct pbc_pattern *); int pbc_pattern_pack (struct pbc_pattern *, void *input, struct pbc_slice * s) ); int pbc_pattern_unpack (struct Pbc_pattern *, struct pbc_slice * s, void * output);

We first need to create a pattern for encoding and decoding. A simple example is this:

Message person {  Required String name = 1;  Required Int32 id = 2;   Optional String email = 3;}

Such a message, for the struct in C, you might want to be like this:

struct person {pbcslice name; int32t id; pbc_slice email;} Use Pbc_slice to represent a string. Because for the message, the string inside is of a length. And does not necessarily end with \\0. Slice can also represent a sub-message that has not been undone.

We use Pbc_pattern_new to let the PBC know the memory layout of this structure.

struct Pbc_pattern * person_p = pbc_pattern_new (env, "person",  "name%s ID%d e-mail%s",  offsetof (struct person , name),  offsetof (struct person, id),  offsetof (struct person, email));

Then you can encode and decode it with Pbc_pattern_pack and Pbc_pattern_unpack. The definition of pattern is lengthy and error-prone (you can also consider using machines to generate them). But I believe that in performance and its sensitive occasions, these are worthwhile, and if you feel that writing these is not worth it, consider using the message API back above.

For repeated data, the pattern API sees them as an array pbc_array.

There is a set of APIs that you can use to manipulate it:

int pbc_array_size (pbc_array); uint32_t Pbc_array_integer (pbc_array array, int index, uint32_t *hi);d ouble Pbc_array_ Real (pbc_array array, int index), struct Pbc_slice * pbc_array_slice (pbc_array array, int index); void Pbc_array_push_ Integer (Pbc_array array, uint32_t low, uint32_t hi), void Pbc_array_push_slice (pbc_array array, struct pbc_slice *); void P Bc_array_push_real (Pbc_array array, double v);

An array is a slightly more complex data structure, but if you have little data, it does not involve additional memory allocations on the heap. However, since it is possible to call these APIs for additional memory allocation, you must manually clear the memory. And before the first use, the data structure must be initialized (Memset of 0 is possible).

void Pbc_pattern_set_default (struct pbc_pattern *, void *data), void pbc_pattern_close_arrays (struct pbc_pattern *, void *data);

Pbc_pattern_set_default can initialize all the fields in one piece of memory, in the form of a pattern. Includes the initialization of the array.

Pbc_pattern_close_arrays using a piece of data, you need to call this API manually, and close the array in this block of data.

On Extension, I finally gave up my direct support. No API similar to get extension is provided. This is because we can deal with extension more simply. I prefix all the extension fields, and if necessary, you can use the concatenation string to get the extended domain within the message packet.

Finally, the environment of the PBC was introduced.

struct pbc_env * pbc_new (void); void Pbc_delete (struct pbc_env *); int pbc_register (struct pbc_env *, struct pbc_slice * SLI CE);

The PBC library is designed to have no global variables, so you can be more secure in a multithreaded environment. Although the library does not consider thread safety issues, it is perfectly fine to use different environments in different threads.

Each environment requires a separate registration of the required message type, which is passed into a PROTOBUF library official tool generated by the. PB Data block. In the form of slice, this data memory can be released when the register is finished.

This block of data is actually encoded in the Google.protobuf.FileDescriptorSet type. This data type is very complex, making the bootstrap process and its difficult to write, which is discussed later.

All code I have open source on GitHub and can fetch code in HTTPS://GITHUB.COM/CLOUDWU/PBC. Detailed usage can also be found in those test files.

This thing is very difficult to write, so the code is very messy, while writing this blog I have not started to organize the structure of the code. You want to use the use, please treat the bug and its friends.

It really hurts to use a complex PROTOBUF protocol to describe the protocol itself. We cannot understand any of the PROTOBUF protocols until we have any of the available protocol parsing libraries. It's a question of having chickens or eggs first. That is, it's hard to write a Pbc_register API out of thin air because it needs to register a Google.protobuf.FileDescriptorSet type before it can start parsing the input packets.

It is very troublesome not to rely on the library itself to parse the definition of Google.protobuf.FileDescriptorSet itself. Of course I can use Google's official tools to generate Google.protobuf.FileDescriptorSet's C + + parsing class to start working. But I do not want to bring too much reliance on this thing.

At first I wanted to customize a simpler format to describe the protocol itself, without too many hierarchies, just a flat array. This is possible with manual parsing. Originally I wanted to write a plugin for PROTOC, to generate a custom protocol format. It was later abandoned because it would be easier to use the library.

But the scheme is still partially used. This is the reason why the bootstrap.c part of the source code. It reads into the description of a simpler version of Google.protobuf.FileDescriptorSet. This data is generated in advance and placed in the descriptor.pbc.h. This data was generated using the LUA library I completed last year. The associated LUA code is not released. Today, of course, the PBC itself is perfect enough for us to write a C version of the PBC. Interested students, can be modified on the basis of TEST_PBC.C.

It's finally over. I wrote 5000 lines of code, and I needed a break.

Original address: <protocol buffers for c>

Go Protocol Buffers for C

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.