GBK to utf8 garbled Analysis

Last Update:2018-12-04 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Reference post: http://tieba.baidu.com/F? KZ = 859774972 http://topic.csdn.net/u/20090822/14/7abb7acf-e7c3-4ecd-979d-c141cd55b452.html "" generation, it is willing to be because of the symbol encoding and decoding method is different, or the conversion process, there are some symbols, Unicode can not be expressed.
In layman's terms, this is like using key A, encrypted information, and key B for decryption. Of course, the results are chaotic and wrong.

The following is an example.

In Chinese Windows systems, the GBK encoding method is used by default. in GBK encoding mode, the Chinese character "depressing" is encoded as the hexadecimal D3 F4 C3 C6, d3 F4 corresponds to the word "yu" and C3 C6 corresponds to the word "stuffy. If D3 F4 C3 C6 is decoded using GBK, the Chinese character "depressing" can be obtained correctly ".

Now let us assume that the windows system thinks D3 F4 C3 C6 is the encoding of the UTF-8 format, it needs to be decoded into the GBK format, and the error is displayed.

-------------------------------------
The UTF-8 is encoded in 8 bits. The encoding from the UCS-2 (2-byte Unicode Character Set) to the UTF-8 is as follows:

UCS-2 encoding (HEX) UTF-8 byte stream (Binary)
0000-007f 0 xxxxxxx
0080-07ff 110 XXXXX 10 xxxxxx
0800-FFFF 1110 XXXX 10 xxxxxx 10 xxxxxx
--------------------------------------

Because the system believes that D3 F4 C3 C6 is the encoding of the UTF-8 format, it must first convert to the unicode format, and then use the corresponding encoding in the GBK encoding table to decode the Chinese character.
Reversely following the conversion rules for the UTF-8 and Unicode given above.

1, first analyze the byte D3, D3 binary represents 11010011, view the table above, starting with 110, must be two bytes of UTF-8 characters, so as to take D3 F4 as a whole analysis.

2. the binary representation of the dual-byte D3 F4 and D3 F4 is 110110011 11110100. In the preceding table, the binary value starting with 110 must start with 10, and F4 binary represents the beginning of 11, so D3 F4 can not find the corresponding encoding in the UTF-8.

3, because cannot match to the correct UTF-8 code, so discard D3, fill as UTF-8 missing characters ef bf bd, that is, Unicode placeholder U + fffd, symbol?

4. Analysis of F4, C3, C6 in turn can not match the correct UTF-8 code, is also filled with EF BF BD

5. The final converted byte stream is ef bf bd.

6. In the GBK encoding table, find the corresponding encoding and decode it into Chinese characters. Because ef bf corresponds to Baidu, and bd ef corresponds to bf bd, copy the corresponding data to get the result"

Therefore, because of the GBK encoding byte stream, decoding with the UTF-8 method, can not match, is converted to Unicode placeholder byte stream, so as to get the classic garbled "copy ".

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

GBK to utf8 garbled Analysis

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support