The serialization of the gossip

Source: Internet
Author: User
Tags access properties map class

#摘要
Serialization and deserialization are almost something that engineers face every day, but it is not easy to grasp these two concepts precisely: on the one hand, they are often lost in the framework as part of the framework, and on the other, they appear in other more understandable concepts, such as encryption and persistence. However, the selection of serialization and deserialization is an important part of system design or reconstruction, which is more significant in the design of distributed and large data volume systems. The proper serialization protocol can not only improve the generality, robustness, security, and optimize the system performance, but also make the system more easy to debug and expand easily. This paper analyzes and explains the "serialization and deserialization" from multiple angles, and compares several current popular serialization protocols, hoping to help the readers to do the serialization selection.

Brief introduction
The author of the article serves the group's recommendations and personalization groups, which are dedicated to providing quality personalized recommendations and sequencing services at billion levels per day for American mission users. From terabyte-level user behavior data to gigabyte-level DEAL/POI data, the recommended and reordered systems require multiple types of data services, from real-time geo-location data to real-time data, to periodic background jobs. Referral and reordering system customers include a variety of internal services, the company's clients, the United States Group website. In order to provide high-quality data services, in order to achieve good docking with upstream and downstream systems, serialization and deserialization selection is often a major consideration in our system design.

The contents of this article are organized as follows:

The first section gives the definition of serialization and deserialization, and where it is located in the communication protocol.
The second part discusses some characteristics of the serialization protocol from the user's point of view.
The third part describes the typical serialization components in the implementation process, and makes an analogy with the database organization.
In the forth part, we explain the characteristics and application scenarios of several kinds of serialization protocols, and give examples of related components.
The last part, based on the characteristics of various protocols and related benchmark data, gives the author's technical selection suggestions.
#一, definitions, and related concepts

The emergence of the internet has brought about the need for inter-machine communication, and both parties to the interconnected communication are required to adopt agreed protocols, which are part of the communication protocol. Communication protocols often adopt a layered model, each layer of different models of functional definition and granularity, for example: TCP/IP protocol is a four-layer protocol, and the OSI model is a seven-layer protocol model. The main function of the Presentation layer in the OSI Seven layer protocol model is to convert the object of the application layer into a contiguous binary string, or vice versa, to convert the binary string into an application-level object-two functions are serialization and deserialization. In general, the application layer of the TCP/IP protocol corresponds to the application layer, the presentation layer, and the session layer of the OSI Seven layer protocol model, so the serialization protocol is part of the application layer of the TCP/IP protocol. In this paper, the interpretation of serialization protocol is based on OSI seven layer protocol model.

Serialization: The process of converting a data structure or object into a binary string
Deserialization: The process of converting a binary string generated during serialization into a data structure or object
Data structures, objects, and binary strings
In different computer languages, data structures, objects, and binary strings are not represented in the same way.

Data structures and objects: for a completely object-oriented language like Java, everything that an engineer does is object, from the instantiation of the class. The closest concept to data structures in the Java language is Pojo (Plain old Java Object) or javabean--those classes that have only setter/getter methods. In the semi-object-oriented language of C + +, the data structure and struct correspond to the object and class.

Binary string: Serialization generates a binary string that refers to a piece of data stored in memory. The C + + language has a memory operator, so the concept of a binary string is easy to understand, for example, the C + + language string can be used directly by the transport layer because it is essentially a binary string stored in memory that ends with '/'. In the Java language, the concept of a binary string is easily confused with a string. In fact, string is a class-one citizen of Java and is a special object. For cross-language communication, the serialized data cannot, of course, be a special data type for a particular language. The binary string in Java refers to Byte[],byte as one of the 8 native data types in Java (Primitive data types).

#二, serialization protocol features

Each of these serialization protocols has advantages and disadvantages, and they have their own unique application scenarios at the beginning of the design. In the process of system design, it is necessary to take into account all aspects of serialization requirements, comprehensively compare the characteristics of various serialization protocols, and finally give a compromise scheme.

Versatility
Versatility has a two-level meaning:
First, the technical level, whether the serialization protocol supports cross-platform, cross-language. If not supported, the versatility on the technical level is greatly reduced.
Second, popularity, serialization and deserialization require multi-stakeholder engagement, and seldom-used protocols often mean expensive learning costs; On the other hand, low-prevalence protocols often lack a stable and mature cross-lingual, cross-platform public package.

Robust/Robust
The following two reasons cause the protocol not strong enough:
First, maturity is not enough, an agreement from the formulation to implementation, to the final maturity is often a long period. The robustness of the Protocol relies on extensive and comprehensive testing, and the use of serialization protocols at the testing stage poses a high risk for systems that are committed to delivering high-quality services.
Second, the language/platform of the unfairness. To support cross-language, cross-platform capabilities, the creator of the serialization protocol needs to do a lot of work, but when there are irreconcilable features between supported languages or platforms, the protocol-makers need to make a difficult decision-to support the language/platform that more people use, or to support more languages/ Platform and discard a feature. When the protocol's creator decides to provide more support for a language or platform, the robustness of the agreement is sacrificed for the user.

Debug/Readability
Serialization and deserialization of data correctness and business correctness of debugging often takes a long time, good debugging mechanism will greatly improve the development efficiency. The serialized binary string often does not have human eye readability, in order to verify the correctness of the serialization results, the writer must not write the deserialization program at the same time, or provide a query platform--this is more expensive, on the other hand, if the reader does not successfully implement the deserialization, This poses a major challenge to problem finding-it is difficult to locate whether it is caused by a bug in its own deserialization program or due to incorrect data after it is serialized by the writer. For cross-company debugging, the problem becomes more serious for the following reasons:
First, support is not in place, cross-company debugging may not get timely support after the problem arises, which greatly prolongs the commissioning cycle.
Second, access restrictions, debugging phase of the query platform may not be public, which increases the reader's verification difficulty.

If the serialized data is readable by the human eye, this will greatly improve debugging efficiency, and XML and JSON have the advantage of human-readable readability.

Performance
Performance includes two aspects, time complexity and spatial complexity:
First, space overhead (verbosity), serialization needs to add a descriptive field to the original data for deserialization parsing. If the serialization process introduces excessive overhead, it can lead to excessive stress on the network, the disk, and so on. For a mass distributed storage system, the amount of data is often in terabytes, and the huge extra space overhead means high costs.
Second, time overhead (complexity), complex serialization protocols result in longer parsing times, which can make the serialization and deserialization phases a bottleneck for the entire system.

Extensibility/Compatibility
In the era of mobile interconnection, the renewal cycle of business system requirements becomes faster, new demands are emerging, and old systems need to be maintained. If the serialization protocol is well-extensible, it supports the automatic addition of new business fields without affecting the old service, which will greatly provide the flexibility of the system.

Security/access Restrictions
In the process of serialization selection, security considerations often occur in cross-LAN access scenarios. When communication occurs between companies or across rooms, for security reasons, access to cross-LAN is often limited to 80 and 443 ports based on Http/https. If you are using a serialization protocol that is not supported by a compatible and mature HTTP Transport layer framework, it may result in one of the following three results:
First, service availability is reduced because of access restrictions.
Second, forced to re-implement security agreements, resulting in a significant increase in implementation costs.
Third, open more firewall ports and protocol access at the expense of security.

#三, serialized, and deserialized components

Typical serialization and deserialization processes often require the following components:

IDL (Interface Description Language) file: The parties involved in the communication need to make a relevant agreement (specifications) about the content of the communication. In order to establish a language-and platform-agnostic Convention, this Convention needs to be described in a language that is not specific to the development language and platform. This language is called the Interface Description Language (IDL), and the protocol conventions that are written in IDL are called IDL files.
The contents of the contract in IDL Compiler:idl file in order to be visible in each language and platform, a compiler is required to convert the IDL file into a dynamic library for each language.
Stub/skeleton Lib: The work code that is responsible for serialization and deserialization. Stub is a piece of code deployed in the Distributed System client, on the one hand receive the application layer parameters, and after serialization through the underlying protocol stack sent to the server, on the other hand to receive the service end of the serialized data, deserialization to the client application layer; Skeleton deployed on the server, In contrast to the stub, it receives serialization parameters from the transport layer, deserializes it to the server application layer, and then serializes the execution results of the application layer to the client stub.
Client/server: Refers to the application layer program code, they are faced with the specific language of IDL living class or struct.
The underlying protocol stack and the Internet: The serialized data is transferred to the digital signal via the underlying transport layer, network layer, link layer, and Physical layer protocol to the Internet.

Comparison of serialization components and database access Components
Database access is relatively familiar to many engineers, and the components used are relatively easy to understand. The following table analogies the corresponding relationships between some of the components used in the serialization process and the database access components, so that you have a better grasp of the concept of serialized related components.

Serialization Component Database Component description
Idlddl language used to build a table or model
DL filedb Schema table create file or model file
Stub/skeleton LIBO/R mapping to map class and table or data model
#四, several common serialization and deserialization protocols

The early serialization protocols of the Internet mainly include COM and CORBA.

COM is mainly used for Windows platforms, and does not really implement cross-platform, in addition, the principle of COM serialization using the compiler in the virtual table, making its learning cost is huge (think about this scenario, the engineer needs to be a simple serialization protocol, but first mastered the language compiler). Extended properties are cumbersome because the serialized data is tightly coupled with the compiler.

CORBA is a relatively good early implementation of cross-platform, cross-language serialization protocol. The main problem with Cobra is that too many parties bring too many versions, the compatibility between versions is poor, and the use of complexity is obscure. These political and economic, technical and early design problems, eventually led to the gradual disappearance of Cobra. The version after J2SE 1.3 provides RMI-IIOP technology based on the CORBA protocol, which enables Java developers to develop CORBA in a purely Java language.

Here we introduce and compare several of the more popular serialization protocols, including XML, JSON, PROTOBUF, Thrift, and Avro.

An example
As mentioned earlier, serialization and deserialization tend to be obscure and subtle, often mutually accommodating with other concepts. To make it even better to understand the specific implementation of the concepts of serialization and deserialization in each protocol, we have an example interspersed in various serialization protocol explanations. In this example, we want to pass a user's information across multiple systems, and at the application level, if you are in the Java language, the class object you are facing is as follows:

Class Address
{
Private String City;
Private String postcode;
Private String Street;
}
public class UserInfo
{
Private Integer userid;
private String name;
Private list<address> Address;
}

Xml&soap
XML is a common serialization and deserialization protocol, which has the advantages of cross-machine, cross-language and so on. XML has a long history, and its 1.0 version was established as early as 1998 and is widely used today. The original goal of XML was to mark the Internet document, so its design concept included readability for both humans and machines. However, when the design of this markup document is used to serialize objects, it is lengthy and complex (Verbose and Complex). XML is essentially a descriptive language and has a self-describing (self-describing) attribute, so XML itself is used for XML serialization IDL. There are two standard XML description formats: DTD (Document Type definition) and XSD (XML Schema definition). As a descriptive language for human-readable (human-readable), XML is widely used in configuration files such as the O/R mapping, Spring Bean configuration file, and so on.

SOAP (Simple Object Access protocol) is a widely used, structured messaging protocol based on XML for serialization and deserialization protocols. Soap has so much influence on the internet that we give the soap-based solution a specific name--web service. Although soap can support a variety of transport layer protocols, the most common way to use soap is xml+http. The main interface Description Language (IDL) for the SOAP protocol is WSDL (Web Service Description Language). SOAP is secure, extensible, cross-lingual, cross-platform, and supports multiple transport layer protocols. Without considering cross-platform and cross-language requirements, XML has a very simple and easy-to-use serialization method in some languages, without the need for IDL files and third-party compilers, such as Java+xstream.

Self-description and recursion
Soap is a protocol that uses XML for serialization and deserialization, and its IDL is WSDL. The WSDL descriptor is an XSD, and the XSD itself is an XML file. There is an interesting problem that is mathematically referred to as "recursion," which often occurs in something that has a self-attribute (self-description).

IDL file Examples
Examples of using WSDL to describe the basic information of the above users are as follows:

<xsd:complextype name= ' Address ' >
<xsd:attribute name= ' city ' type= ' xsd:string '/>
<xsd:attribute name= ' postcode ' type= ' xsd:string '/>
<xsd:attribute name= ' street ' type= ' xsd:string '/>
</xsd:complexType>
<xsd:complextype name= ' UserInfo ' >
<xsd:sequence>
<xsd:element name= ' address ' type= ' tns:address '/>
<xsd:element name= ' Address1 ' type= ' tns:address '/>
</xsd:sequence>
<xsd:attribute name= ' userid ' type= ' xsd:int '/>
<xsd:attribute name= ' name ' type= ' xsd:string '/>
</xsd:complexType>
Typical application scenarios and non-application scenarios
The SOAP protocol has a broad mass base, the HTTP-based transport protocol makes it a good security feature when traversing firewalls, and the human-readable (human-readable) feature of XML makes it excellent for debugging. The increasing increase of Internet bandwidth also makes up for the disadvantage of big space overhead (Verbose). It is a good choice for services that have a relatively small amount of data transfer between companies or require relatively low real-time requirements (such as the second level).

Because of the extra space overhead of XML, the massive increase in data volume after serialization, the persistent application of a large sequence of data, which means huge memory and disk overhead, is not suitable for XML. In addition, XML serialization and deserialization of the space and time overhead are relatively large, for performance requirements at the MS level of service, is not recommended to use. Although WSDL has the ability to describe objects, soap S is also simple, but the use of soap is definitely not easy. WSDL files are not intuitive for users accustomed to object-oriented programming.

JSON (Javascript Object Notation)
JSON originates from the weakly typed language JavaScript, which derives from a concept called the "associative array", which is essentially the "attribute-value" way to describe the object. In fact, in weakly typed languages such as JavaScript and PHP, classes are described in associative array. The following advantages of JSON make it fast becoming one of the most widely used serialization protocols:
1. This associative array format is very consistent with the engineer's understanding of the object.
2, it maintains the human eye readable (human-readable) of the XML advantage.
3, compared to XML, the serialized data is more concise. A study from the following links shows that the size of the file after the serialization of the XML is approximately twice times that of the JSON. Http://www.codeproject.com/Articles/604720/JSON-vs-XML-Some-hard-numbers-about-verbosity
4, it has the congenital support of JavaScript, so it is widely used in the application of Web browser, it is the fact standard protocol of Ajax.
5, compared with XML, its protocol is relatively simple, the resolution speed is relatively fast.
6. The loose associative array makes it scalable and compatible.

IDL paradox
JSON is simply too simple, or too much like a class in a variety of languages, so using JSON for serialization does not require IDL. This is really amazing, there is a natural serialization protocol, the implementation of its own cross-language and cross-platform. But the fact is not so magical, and the reason for this illusion comes from two reasons:
First, associative array in weakly typed language is the concept of class, in PHP and JavaScript associative array is the actual implementation of its class, so in these weakly typed languages, JSON is very well supported.
Second, the purpose of IDL is to write IDL files, and IDL files are compiled by IDL compiler can produce some code (Stub/skeleton), which is really responsible for the corresponding serialization and deserialization work of the component. But because the associative array is too similar to the class in the general language, they form one by one correspondence, which allows us to use a standard set of code for the corresponding conversions. For weakly typed languages that support associative arrays themselves, the language itself has the ability to manipulate JSON-serialized data, and for Java, a strongly typed language, can be resolved in a unified manner, such as the Gson provided by Google.

Typical application scenarios and non-application scenarios
JSON can be used to replace XML in many scenarios, more concise and faster to parse. Typical application scenarios include:
1, the transfer of data between companies relatively small, real-time requirements of relatively low (such as the second level) of services.
2. Web browser-based AJAX requests.
3, because the JSON has very strong pre-and post-compatibility, for the interface is often changed, and the requirements of the high-adjustable scene, such as mobile app and server communication.
4, because the typical application scenario of JSON is json+http, suitable for cross-firewall access.

In general, the extra space overhead of serializing with JSON is large, and for large data volume services or persistence, this means huge memory and disk overhead, which is not appropriate for this scenario. There is no unified IDL to reduce the constraints on the party, the actual operation can only be used to document the contract, which may bring some inconvenience to debugging, extend the development cycle. Since JSON serialization and deserialization in some languages requires a reflection mechanism, it is not recommended for performance requirements of MS level.

IDL file Examples
The following is an example after UserInfo serialization:


{"UserID": 1, "name": "Messi", "address": [{"City": "Beijing", "postcode": "1000000", "Street": "Wangjingdonglu"}]}
Thrift
Thrift is a high-performance, lightweight RPC service framework from Facebook Open source that is designed to meet the demands of today's big data volumes, distributed, cross-language, cross-platform data communications. However, thrift is not just a serialization protocol, but an RPC framework. Compared with JSON and XML, thrift has a great increase in space cost and resolution performance, it is an excellent RPC solution for distributed systems with high performance requirements, but because thrift serialization is embedded in the thrift framework, The thrift framework itself does not reveal the serialization and deserialization interfaces, which makes it difficult to work with other transport layer protocols (such as HTTP).

Typical application scenarios and non-application scenarios
For high-performance, distributed RPC services, thrift is an excellent solution. It supports a wide range of languages and rich data types, and has strong compatibility with data field additions and deletions. It is therefore well suited for the standard RPC framework as a service-oriented build (SOA) within the company.

However, Thrift's documents are relatively scarce, and the mass base used at present is relatively small. In addition, because the server is based on its own socket service, security is a concern when accessing across firewalls, so it is prudent to communicate between companies. In addition, thrift serialized data is binary array, not readable, debugging code is relatively difficult. Finally, since the serialization of thrift and the framework are tightly coupled, it is not suitable for data persistence serialization protocol to be able to read and write data directly to the persistent layer.

IDL file Examples
struct Address
{
1:required string City;
2:optional string postcode;
3:optional String Street;
}
struct USERINFO
{
1:required string UserID;
2:required i32 name;
3:optional list<address> Address;
}
Protobuf
The PROTOBUF has many of the typical characteristics required for excellent serialization protocols:
1, the standard IDL and IDL compiler, which makes it very friendly to engineers.
2, the serialized data is very concise, compact, compared with XML, its serialized data volume is about 1/3 to 1/10.
3, parsing speed is very fast, about 20-100 times faster than the corresponding XML.
4, provides a very friendly dynamic library, using a very brief, deserialization only need a line of code.

Protobuf is a purely presentation layer protocol that can be used with a variety of transport layer protocols, and the PROTOBUF documentation is perfect. But because Protobuf is generated by Google, it currently supports only Java, C + +, and Python in three languages. In addition, PROTOBUF supports a relatively small number of data types and does not support constant types. Since its design concept is purely presentation layer protocol (Presentation layer), there is currently no RPC framework that specifically supports PROTOBUF.

Typical application scenarios and non-application scenarios
Protobuf has a broad user base, small space overhead, and high resolution performance as a bright spot, and is ideal for intra-company RPC calls with high performance requirements. Since PROTOBUF provides the standard IDL and the corresponding compiler, its IDL file is a very strong business constraint for all parties involved, in addition, PROTOBUF is independent of the transport layer, with HTTP having good cross-firewall access properties, So protobuf is also suitable for high performance requirements between companies. Due to its high analytic performance and relatively small amount of data after serialization, it is very suitable for persistence scenarios of application layer objects.

Its main problem is that it supports a relatively small number of languages, and because there is no binding standard underlying transport layer protocol, it is relatively troublesome to debug the Transport layer protocol between companies.

IDL file Examples
Message Address
{
Required String City=1;
Optional String postcode=2;
Optional String street=3;
}
Message UserInfo
{
Required String userid=1;
Required String name=2;
Repeated Address address=3;
}
Avro
The generation of Avro solves the lengthy and non-IDL problems of JSON, and Avro belongs to a subproject of Apache Hadoop. Avro provides two serialization formats: JSON format or binary format. Binary format is comparable to PROTOBUF in terms of space overhead and parsing performance, and JSON format facilitates debugging in the test phase. Avro supports a very rich range of data types, including the union type within the C + + language. The AVRO supports IDL in JSON format and IDL (experimental phase) similar to thrift and protobuf, which can be turned between each other. The schema can be sent at the same time as the data is transmitted, plus the self-describing properties of the JSON, which makes the Avro ideal for dynamic type languages. Avro is usually stored with the schema when the file is persisted, so the Avro serialization file itself has a self-describing attribute, so it is ideal for persisting data formats for hive, pig, and MapReduce. For different versions of the schema, when making RPC calls, the server and client can confirm the schema with each other during the handshake phase, which greatly improves the final data parsing speed.

Typical application scenarios and non-application scenarios
Avro parsing performance is high and the data after serialization is very concise and more suitable for high-performance serialization services.

Because Avro is currently in the experimental phase of non-JSON-formatted IDL, IDL in JSON format is not intuitive for engineers accustomed to statically typed languages.

IDL file Examples
Protocol UserService {
Record Address {
String City;
String postcode;
String Street;
}
Record UserInfo {
String name;
int userid;
Array<address> Address = [];
}
}
The corresponding JSON schema format is as follows:

{
"Protocol": "UserService",
"Namespace": "Org.apache.avro.ipc.specific",
"Version": "1.0.5",
"Types": [{
"Type": "Record",
"Name": "Address",
"Fields": [{
"Name": "City",
' Type ': ' String '
}, {
"Name": "Postcode",
' Type ': ' String '
}, {
"Name": "Street",
' Type ': ' String '
} ]
}, {
"Type": "Record",
"Name": "UserInfo",
"Fields": [{
"Name": "Name",
' Type ': ' String '
}, {
"Name": "userid",
' type ': ' int '
}, {
"Name": "Address",
' type ': {
"Type": "Array",
"Items": "Address"
},
"Default": []
} ]
} ],
"Messages": {}
}

#五, benchmark and selection suggestions

# #Benchmark
The following data is from https://code.google.com/p/thrift-protobuf-compare/wiki/Benchmarking

Parsing performance

The spatial cost of serialization

The following conclusions can be drawn:
1. XML Serialization (Xstream) is poor both in performance and simplicity.
2, thrift and protobuf compared in the space-time cost has a certain disadvantage.
3, Protobuf and Avro in two aspects of performance are very superior.

Selection Suggestions

The five serialization and deserialization protocols described above each have their respective characteristics and are suitable for different scenarios:
1, for inter-company system calls, if the performance requirements of more than 100ms services, XML-based SOAP protocol is a worthy consideration.
2, based on the Web browser Ajax, and mobile app and server communication between, the JSON protocol is the first choice. JSON is also a good choice for applications where performance requirements are not high, or if the dynamic type language is the dominant, or if the data payload is transmitted in small cases.
3, for the debugging environment is worse scene, the use of JSON or XML can greatly improve the efficiency of debugging, reduce the cost of system development.
4, when the performance and simplicity have a very high demand for the scene, Protobuf,thrift,avro has a certain competitive relationship.
5. Protobuf and Avro are the first choice for persistent scenarios of data at T-level. Avro is a better choice if the persisted data is stored in Hadoop sub-projects.
6, because Avro's design concept is biased to dynamic type language, Avro is a better choice for dynamic language-based application scenarios.
7, for the persistent layer non-Hadoop project, static type language-based application scenarios, PROTOBUF will be more consistent with the development of static type language engineer habits.
8. Thrift is a good choice if you need to provide a complete RPC solution.
9. Protobuf can be a priority if you need to support different transport layer protocols after serialization, or high-performance scenarios that require cross-firewall access.

Http://tech.meituan.com/serialization_vs_deserialization.html

The serialization of the gossip

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.