Serialization of communication protocols

Source: Internet
Author: User

Translated from: http://blog.chinaunix.net/uid-27105712-id-3266286.html

Communication protocol can understand two nodes in order to work together to achieve information exchange, negotiate certain rules and conventions, such as the provision of byte order, each field type, using what compression algorithm or encryption algorithm. Common protocols such as TCP,UDO,HTTP,SIP are commonly found. The protocol has process specifications and coding specifications. process, such as the call flow, the signaling process, the coding specification specifies how all signaling and data package/unpacking.

The coding specification is what we usually call codec, serialization. It is not only used in communication work, but also in the storage work we often use. If we often want to put in-memory objects on disk, we need to do the data serialization of the object.

This article uses the first step, first of all to give an example, and then constantly ask questions-solve the perfect, such an iterative evolution of the way, introduced a protocol gradually evolved and perfected, and finally summarized. After reading it, it is easy to make and choose your own coding protocol after you work.

One, compact mode

The example in this article is a and B communication, acquiring or setting up basic information, the first step for a general developer is to define a protocol structure:

struct userbase

{

unsigned short cmd;//1-get, 2-set, defines a short, in order to extend more commands (ideal so plump)

unsigned char gender; 1–man, 2-woman, 3-??

Char Name[8]; Of course, this can be defined as String name, or Len + value combination, for easy narration, use simple fixed-length data

}

In this way, a basic without coding, directly from the memory copy out, and then the cmd do a bit of network byte sequence transformation, sent to B. B can also parse, everything is very harmonious and happy.

At this time the result of the code can be expressed as a graph (1 lattice of one byte)

This encoding, which I call the compact mode , means that there is no additional redundant information in addition to the data itself, which can be seen as raw data. In the DOS era, this use is very common, but the memory and network are in the K calculation, the CPU has not reached 1G. If you add extra information, it's not just a CPU that's stretched out, and even memory and bandwidth can't be hurt.

Second, scalability

One day, a in the basic information add a birthday field, and then tell B

struct userbase

{

unsigned short cmd;

unsigned char gender;

unsigned int birthday;

Char Name[8];

}

This is b to worry about, received a packet, do not know whether the 3rd field is the old protocol in the name field, or the new protocol birthday. This is after a, and B finally learned from the lesson that an important feature of the protocol- compatibility and extensibility .

Thus, A and B decide to scrap the old protocol and start from scratch, making a protocol that is compatible with each version later. The method is simple, which is to add a version field.

struct userbase

{

unsigned short version;

unsigned short cmd;

unsigned char gender;

unsigned int birthday;

Char Name[8];

}

In this way, A and B will be relieved, and later it will be easy to expand. Adding fields is also convenient. This method, even now, should still be used by many people.

Second, better scalability

After a long time, A and B found a new problem, that is, not adding a field to change the version number, which is not the point, the point is that the code maintenance is quite cumbersome, each version of a case branch, to the best, the code inside case dozens of branches, looks ugly and maintenance costs high.

A and B think about it, feel that a version to maintain the entire protocol, not thin enough, so feel for each field to add an additional information--tag, although the increase in memory and bandwidth, but now is not the same as the previous year, can allow these redundancy, in exchange for ease of use .

struct userbase

{

1 unsigned short version;

2 unsigned short cmd;

3 unsigned char gender;

4 unsigned int birthday;

5 Char name[8];

}

After these agreements were made, A and B were very proud of the agreement and were free to increase and decrease the fields. Expand casually.

The reality is always very cruel, soon there is a new demand, name using 8 bytes is not enough, the maximum length may reach 100 bytes, A and B on the Huanghui, always can not even call "Steven" people, each time according to 100 bytes packaging, although not bad money, can not be wasted.

So A and B find all the information, found the ans.1 coding specifications, good things ah. ASN.1 is a iso/itu-t standard. One of the coding ber (Basic Encoding Rules) is easy to use, it uses <tag, Length, value> ternary encoding, or TLV code.

Each field is encoded after the memory is organized as follows

Fields can be structs, which can be nested

After a and b use the TLV packaging protocol, the data memory organization is probably as follows:

TLV has a very good extensibility and is easy to learn. It also has drawbacks because it adds 2 additional redundant information, tag and Len, especially if the protocol is mostly basic data type int, short, byte. Will waste several times of storage space. In addition, the meaning of value is exactly what it needs to be described in advance by both parties, i.e. TLV does not have structural and self-explanatory properties.

Iii. self explanatory Nature

When A and B adopt the TLV protocol, it seems that the problem is solved. However, it is not very perfect, decided to add self-explanatory features, so that the clutch can know the various field types, without looking at the protocol description document. This type of improvement is tt[l]v (Tag,type,length,value), where L is a fixed-length base data type such as Int,short, long, byte, because its length is known, so l do not need it.

Then some type values are defined as follows

Type

Type value

Type description

bool

1

Boolean value

int8

2

A character with a symbol

Uint8

3

A character with a symbol

Int16

4

16-bit signed integer

UInt16

5

16-bit unsigned integer

Int32

6

32-bit signed integer

UInt32

7

32-bit unsigned integer

...

String

12

string or binary sequence

struct

13

Custom structure, nested use

List

14

Ordered list

Map

15

Unordered list

After serializing according to TTLV, the memory is organized as follows

After the change, A and B found, does bring a lot of benefits, not only can you choose the addition and deletion of the field, you can also modify the data type, for example, to change the cmd to int cmd, can be seamlessly compatible. It's too much for me.

Three, cross-language characteristics

One day a new colleague C, he wrote a new service that needed to communicate with a, but C was in Java or PHP language, no unsigned type, resulting in a negative resolution failure. To solve this problem, a re-plan the protocol type, do some stripping language features, define some commonalities. A mandatory constraint is made on the usage type. Despite the constraints, which bring general-purpose and simplicity, and cross-lingual , everyone agrees, so there is a type specification.

Type

Type value

Type description

bool

1

Boolean value

int8

2

A character with a symbol

Int16

3

16-bit signed integer

Int32

4

32-bit signed integer

...

String

12

string or binary sequence

struct

13

Custom structure, nested use

List

14

Ordered list

Map

15

Unordered list

Iv. Code Automation -- The creation of IDL language

But A and B found a new annoyance, is each a set of new protocols, all to be decoded, debugging, although TLV is very simple, but write codec is a no technical content of the dull physical activity, a very obvious problem is, because a lot of copy/past, whether it is novice or veteran, very easy to make mistakes, A mistake is time-consuming to locate the wrong one. So a thought of using tools to generate code automatically.

IDL (Interface Description Language), which is a description language, is also an intermediate language, IDL a mission is to standardize and constrain, as mentioned earlier, canonical usage types, providing cross-language features. Analyze IDL files with tools to generate various language codes

Gencpp.exe sample.idl Output Sample.cpp sample.h

Genphp.exe sample.idl Output sample.php

Genjava.exe sample.idl Output Sample.java

is not simple and efficient J

Iv. Summary

You see here, is not very familiar. Yes, in the end, the agreement is the same as the Facebook thrift and the Google Protocol buffer protocol. Includes the JCE protocol used by the company wirelessly. When I look at the IDL files of these protocols, I find almost the same. Just a little bit of differentiation.

These protocols add some features to some of the details:

1, compression, where compression is not referred to as gzip, such as universal compression, refers to the integer compression, such as int type, in many cases the value is less than 127 (the case of a value of 0 is very much), do not need to occupy 4 bytes, so these protocols do some refinement, the int type according to the case, only use 1/2/3/ 4 bytes, is actually a TTLV protocol.

2, reuire/option characteristics: This feature has two functions, 1, or compression, and sometimes a lot of fields, some fields can be taken with or without, not assigning a value when not to take a default value package, so it is wasteful, if the field is an option attribute, no assignment, You don't have to pack. 2, a bit of logic constraint function, specify which fields must have, strengthen the calibration.

Serialization is the basis of a communication protocol, whether it is a signaling channel or a data channel, or RPC, which needs to be used. Extensibility and cross-language features are considered early in the design protocol. Will save a lot of trouble for the future.

Ps

This article mainly introduces the binary communication protocol serialization, there is no text-based protocol. In a sense, the text protocol is inherently compatible and extensible. Unlike binary, there are so many issues to consider. The text protocol is easy to debug (such as grasping the package is the visible character, Telnet debugging, the data package can be manually generated without special tools), easy to learn is its most powerful advantage.

The advantage of binary protocol is performance and security. But debugging trouble.

Both have their merits and are selected on demand. (Stevenrao)

Serialization of communication protocols

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.