JSON data garbled problem

Source: Internet
Author: User

http://ifeve.com/json-code-problem/

*****************************

Background
Programmers mention that the code should be not unfamiliar, such as GBK, Utf-8, ASCII and other these codes are often used, but occasionally also a garbled problem, the solution to this problem is most of the first Google and Baidu, in the end may find a bit of information in a nook and cranny, Then the mechanical step-by-step imitation down, the results of the problem may really be solved, and then Quickie, the next encounter similar problems, may also repeat the above process. Few people have the patience to spend their energies figuring out the root cause of the problem and what the rationale for solving them is. This article is through a practical case, try to clarify what is coding, garbled and how to produce, and how to solve. This case is starting from lua_cjson.c This library, not familiar with this library is OK, do not need to be familiar with it, we just borrowed it to explain the garbled problem, just follow the idea of the article to go.

Some time ago colleagues in a new project to use the LUA_CJSON.C this library (hereinafter referred to as Cjson), the JSON string into the local LUA data structure, but in the process of using the Chinese garbled problem, it is strange that only a few words are garbled, which includes "hidden" word, Other words everything is OK. After understanding the JSON string with the GBK encoding, then the question came, why the problem with GBK encoding, what is the reason? And how do we solve this problem?

To explain this, let's start by looking at what the JSON string requires.

JSON specification
The JSON full name JavaScript Object notion is a text of structured data serialization that can describe four basic types (Strings,numbers,booleans and null) and two types of structures (objects and arrays )。

There is a passage in RFC4627

A string is a sequence of zero or more Unicode characters.
A string consists of 0 or more Unicode character sequences.

Here's a little explanation of what a Unicode character is. We all know that ASCII characters have letters, numbers, etc., but he only has more than 100 characters. For example, Chinese characters are not ASCII characters, but Unicode contains Chinese characters, so Chinese characters can be Unicode characters. The thing to note here is that Unicode characters are actually some symbols.

Now another question comes up, how to represent these characters in JSON text.
In the canonical encoding fragment is that said

JSON text shall is encoded in Unicode. The default encoding is UTF-8.
The JSON text shall encodes Unicode characters. UTF-8 encoding is used by default.

We see the use of the shall[rfc2119] keyword here, which means that the character must be encoded before it can be used as a JSON string. and Utf-8 encoding is used by default.
How do you determine which Unicode encoding to use?

Since the first and characters of a JSON text would always be ASCII characters[rfc0020],

It is possible to determine whether a octet stream is UTF-8, UTF-16 (Be or LE), or
UTF-32 (Be or LE) is looking at the pattern of nulls in the first four octets.

Because the first two characters of the JSON text (note that this is the character, not the byte) must be an ASCII character, so it can be from one byte
The first four bytes of the stream (note is byte) are determined that the byte stream is UTF-8, UTF-16 (Be or Le), or UTF-32 (be or LE) encoded.

XX utf-32be (u32 code big-endian)
XX utf-32le (u32 coding small End)
xx xx. xx utf-16be (U16 code big-endian)
xx xx utf-16le (U16 coding Small End)
xx xx xx xx UTF-8 (utf-8 code)
Ps:
U32 represents a character with a 32-bit 4-byte integer;

The U16 uses a 16-bit 2-byte integer to represent a character, and if 2 bytes are not, use two 16-bit consecutive 2-byte integer
Number, so it appears that there are 4 bytes in the U16 encoding that represent one character, and four bytes of u32
Like this, the character does not have a word order between the first two bytes in the U16 and the last two bytes.

Utf-8 uses multiple 8-bit 1-byte sequences to represent one character, so there is no problem with the word order.

As of now we do not see any information about the use of GBK encoding, JSON text can not be encoded with GBK, if it really can not be used, then why Cjson not all the GBK code interpreted garbled, but only a few words is garbled.
There is a description of the JSON parser in the specification:

A JSON parser transforms a JSON text into another representation.
A JSON parser must accept all texts this conform to the JSON grammar.
A JSON Parser may accept Non-json forms or extensions.

The JSON parser can convert a JSON literal to another representation.
The JSON parser must accepts all text that conforms to the JSON syntax.
The JSON parser may accept non-JSON forms or extended text.

The reason for garbled characters
From the specification to the parser description can be seen, the specification does not require the parser must be the encoding of the text to do the verification, and the parser can also choose to accept non-JSON form of text.

Now let's take a look at how the Cjson parser is doing, in the comments at the beginning of Cjson, saying this:

Invalid UTF-8 characters is not detected and would be passed untouched.
If required, UTF-8 error checking should be do outside this library.
The discovery of invalid UTF-8 encoding will be spared, if necessary, the UTF-8 encoding should be checked outside of the library.

Said very clearly, to non-UTF8 code directly let go, do not do any inspection, so with GBK code does not conform to the specification, but can be resolved by the answer came out. That "hidden" and so on these characters garbled question is how to return a matter? We will now look at Cjson on the other two codes in the Code UTF16, UTF32 is how to do, and then again garbled problem.

This is done at the beginning of the Cjson parsing method:

/* Detect Unicode other than UTF-8(see RFC 4627, Sec 3)
*
* CJSON can support any simple data type, hence only the first
* character is guaranteed to be ASCII (at worst:‘"‘). This is
* still enough to detect whether the wrong encoding is in use.
*/
if (json_len >=2 && (!json.data[0] || !json.data[1]))
luaL_error(1,"JSON parser does not support UTF-16 or UTF-32");

As we said before, the first two characters of a JSON string must be ASCII characters, which means that a JSON string has at least two bytes. So this code first determines if the length of the JSON string is greater than 2, and then the value of the first two bytes of the string, Whether there is zero to determine whether the text is non-utf-8 encoded. The results have been seen, people do not support the specification of the U16 and U32 coding.

Now we come to see "hidden" this son is how to become garbled, after the analysis of Cjson source code, Cjson in the processing of the word stream when met ' \ ' backslash will guess after a byte should be escaped characters, such as \b, \ r and other characters, if it is the release, if not, Cjson that this is not a correct JSON format, it will take this byte to kill, so the man in the two-byte representation of the abruptly to bend.
What does the word "hidden" have to do with ' \ ' backslashes? The query for these two characters is expressed in the encoding:
"Hidden" 0x965c
"\" 0x5C

So we see the "hidden" word of the low byte and the "\" character is the same, are 0x5c, if this time "hidden" word is not B, r and so on can be transferred ASCII characters, Cjson will this byte and immediately after the byte erased, so garbled generated.

Then how should we solve this problem, let Cjson can smoothly support GBK code, first we look at GBK encoding is how to do, why will appear low byte and ASCII conflict problem.

Gb_ Coding Series
Let's take a look at the coding range problem for the GB series:
GB2312 (1980) contains 7,445 characters, 6,763 kanji characters, and 682 other characters.
Each character and symbol is represented by two bytes, and in order to be compatible with ASCII, the handler uses the EUC storage method.
Encoding range of Chinese characters
High byte: 0xb0–0xf7,
Low byte: 0xa1–0xfe,

Occupy 72*94=6768,0xd7fa–0xd7fe unused.

The GBK contains 21,886 characters, one-byte and two-byte encodings.
Single-byte representation range
8-bit: 0x0–0x7f
Double-byte representation range
High byte: 0x81–0xfe
Low byte: 0x40–0x7e, 0x80–0xfe

GB18030 contains 70,244 characters, using 1, 2, 4 byte encoding.
Single-byte range
8-bit: 0x0–0x7f
Double byte range
High byte: 0x81–0xfe
Low byte: 0x40–0xfe

Four byte range
First byte: 0x81–0xfe
Second byte: 0x30–0x39
Third byte: 0x81–0xfe
Four bytes: 0x30–0x39

Since the GB class encoding is backward compatible, there is a problem here, because the GB2312 two byte high is 1, the code point matching this condition only 128*128=16384. GBK and GB18030 are larger than this number, so for the sake of compatibility, we see from the above coding range, both of these codes are used in the case that the highest bit of low byte can be 0.

The final conclusion is that in GBK encoding, as long as the character is two bytes, and the low byte is 0x5c characters will be cjson garbled.

Solution:
1) do not use GBK encoding to convert your string to UTF-8 encoding.
2) The Cjson source to make a slight change, that is, each byte before the arrival of the first to determine whether the byte is greater than 127, if greater than the byte will be a subsequent byte to let go, otherwise to Cjson to deal with.

JSON data garbled problem

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.