GBK to utf8 garbled Analysis

Source: Internet
Author: User
Reference post: http://tieba.baidu.com/F? KZ = 859774972 http://topic.csdn.net/u/20090822/14/7abb7acf-e7c3-4ecd-979d-c141cd55b452.html "" generation, it is willing to be because of the symbol encoding and decoding method is different, or the conversion process, there are some symbols, Unicode can not be expressed.
In layman's terms, this is like using key A, encrypted information, and key B for decryption. Of course, the results are chaotic and wrong.

The following is an example.

In Chinese Windows systems, the GBK encoding method is used by default. in GBK encoding mode, the Chinese character "depressing" is encoded as the hexadecimal D3 F4 C3 C6, d3 F4 corresponds to the word "yu" and C3 C6 corresponds to the word "stuffy. If D3 F4 C3 C6 is decoded using GBK, the Chinese character "depressing" can be obtained correctly ".

Now let us assume that the windows system thinks D3 F4 C3 C6 is the encoding of the UTF-8 format, it needs to be decoded into the GBK format, and the error is displayed.

-------------------------------------
The UTF-8 is encoded in 8 bits. The encoding from the UCS-2 (2-byte Unicode Character Set) to the UTF-8 is as follows:

UCS-2 encoding (HEX) UTF-8 byte stream (Binary)
0000-007f 0 xxxxxxx
0080-07ff 110 XXXXX 10 xxxxxx
0800-FFFF 1110 XXXX 10 xxxxxx 10 xxxxxx
--------------------------------------

Because the system believes that D3 F4 C3 C6 is the encoding of the UTF-8 format, it must first convert to the unicode format, and then use the corresponding encoding in the GBK encoding table to decode the Chinese character.
Reversely following the conversion rules for the UTF-8 and Unicode given above.

1, first analyze the byte D3, D3 binary represents 11010011, view the table above, starting with 110, must be two bytes of UTF-8 characters, so as to take D3 F4 as a whole analysis.

2. the binary representation of the dual-byte D3 F4 and D3 F4 is 110110011 11110100. In the preceding table, the binary value starting with 110 must start with 10, and F4 binary represents the beginning of 11, so D3 F4 can not find the corresponding encoding in the UTF-8.

3, because cannot match to the correct UTF-8 code, so discard D3, fill as UTF-8 missing characters ef bf bd, that is, Unicode placeholder U + fffd, symbol?

4. Analysis of F4, C3, C6 in turn can not match the correct UTF-8 code, is also filled with EF BF BD

5. The final converted byte stream is ef bf bd.

6. In the GBK encoding table, find the corresponding encoding and decode it into Chinese characters. Because ef bf corresponds to Baidu, and bd ef corresponds to bf bd, copy the corresponding data to get the result"

Therefore, because of the GBK encoding byte stream, decoding with the UTF-8 method, can not match, is converted to Unicode placeholder byte stream, so as to get the classic garbled "copy ".

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.