Weakness of Hessian serialization

Source: Internet
Author: User

Hessian was originally used for Java Binary Web services. The official definition is as follows:

 

The Hessian Binary Web Service Protocol makes Web Services usable without requiring a large framework, and without learning yet another alphabet soup of protocols. because it is a binary protocol, it is well-suited to sending binary data without any need to extend the protocol with attachments.

 

Later, it was widely used in other compilation languages such as Python, C ++, C #, PHP, Ruby, and Erlang.

 

In fact, in addition to web method calls, Hessian also has a common function of cross-language serialization. Hessian is widely used with efficient and compact serialization.

 

The Hessian serialization protocol is well-developed in both the 1.0.2 specification and draft 2.0, but the biggest problem is the serialization of strings (or XML, in hessian, XML serialization is basically the same as that of strings.

The definition of Hessian string serialization is as follows:

 

1.0:

string ::= (s b16 b8 utf-8-data)* S b16 b8 utf-8-data

2.0:

# UTF-8 encoded character string split into 64 K chunksstring ::= X52 B1 B0 <utf8-data> string # non-final Chunk :: ='s 'b1 B0 <utf8-data> # string of length #0-65535: = [x00-x1f] <utf8-data> # string of length #0-31 :: = [x30-x34] <utf8-data> # string of length #0-1023
We can see that both 1.0 and 2.0 are transmitted in segments. The difference between the two versions is that 2.0 only applies to short strings with a length of 0-31 and a length of 0 -.
Special optimization.
The so-called "segmentation" means that the maximum length of a single string cannot be greater than 65536 (2 ^ 16). If the length is greater than this length, it must be segmented by 65536.
For example, if a string is 65537 characters in length, Hessian serialization is roughly as follows:
S xFF <UTF8-DATA> S x00 x01 <UTF8-DATA>
The first lower-case s represents a segment of the string, and the upper-case s represents the last segment. The length of S is two bytes. Note that this is not the length of the entire segment,
The length of the string! The following is a UTF-8 encoded string.
It is reasonable to say that there will not be much problem with this definition, and related optimizations are made for internationalization of characters and long and short strings.
But in fact, when the string is large (why is it large? Think about how to output a large XML result set (more than 2 m). This scheme consumes a lot of resources and is extremely inefficient.
(1) character encoding problems.
The Hessian string adopts English-friendly UTF-8 encoding to reduce the transmission size. However, utf8 is multi-byte encoding, which increases the transmission overhead for non-English characters,
It is important that UTF-8 encoding and decoding must be used for byte-based low-efficiency scanning. When there are many characters, this will become the bottleneck of the whole serialization.
In terms of efficiency, I personally think Hessian adopts UTF16 as the best. Although utf8 is a little smaller, the processing overhead is not just a little larger.

(2) segmentation issues.
Segmentation brings about one of the performance advantages of transmission and the advantages of separation test. However, splitting and combining a huge string, especially when the string must be scanned by byte, brings a great deal of overhead.
(3)Character length issues. (Important)
Hessian saves the number of characters in each character segment, but does not save the length of the entire segment. In this way, the deserialization can only be performed by byte scanning, and there is no way to optimize the reading. Byte verification of utf8 and reading of utf8 strings are not a big problem in highly efficient languages such as C ++, C #, or Java, however, PHP, Python, Ruby, Erlang, and other non-Unicode low-performance languages are terrible.
In a simple test, when PHP parses the hession 3 m string (XML), it takes more than 30 seconds and 30 m for half an hour! C # does not exceed 1 second!
It seems that to make PHP and other languages faster, in addition to specifying the number of characters in the stringRequiredMarkActual byte size, Although this breaks the standard:
string ::= (s b16 b8 C16 C8 utf-8-data)* S b16 b8 C16 C8 utf-8-data
C16 and C8 are the number of bytes of the actual string in this segment. With this mark, PHP and other languages can efficiently process the Hessian large string, 30 M 1 second!
If you have a better solution, please kindly advise!

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.