Reference post: http://tieba.baidu.com/F? KZ = 859774972 http://topic.csdn.net/u/20090822/14/7abb7acf-e7c3-4ecd-979d-c141cd55b452.html "" generation, it is willing to be because of the symbol encoding and decoding method is different, or the conversion process, there are some symbols, Unicode can not be expressed.
In layman's terms, this is like using key A, encrypted information, and key B for decryption. Of course, the results are chaotic and wrong.
The following is an example.
In Chinese Windows systems, the GBK encoding method is used by default. in GBK encoding mode, the Chinese character "depressing" is encoded as the hexadecimal D3 F4 C3 C6, d3 F4 corresponds to the word "yu" and C3 C6 corresponds to the word "stuffy. If D3 F4 C3 C6 is decoded using GBK, the Chinese character "depressing" can be obtained correctly ".
Now let us assume that the windows system thinks D3 F4 C3 C6 is the encoding of the UTF-8 format, it needs to be decoded into the GBK format, and the error is displayed.
-------------------------------------
The UTF-8 is encoded in 8 bits. The encoding from the UCS-2 (2-byte Unicode Character Set) to the UTF-8 is as follows:
UCS-2 encoding (HEX) UTF-8 byte stream (Binary)
0000-007f 0 xxxxxxx
0080-07ff 110 XXXXX 10 xxxxxx
0800-FFFF 1110 XXXX 10 xxxxxx 10 xxxxxx
--------------------------------------
Because the system believes that D3 F4 C3 C6 is the encoding of the UTF-8 format, it must first convert to the unicode format, and then use the corresponding encoding in the GBK encoding table to decode the Chinese character.
Reversely following the conversion rules for the UTF-8 and Unicode given above.
1, first analyze the byte D3, D3 binary represents 11010011, view the table above, starting with 110, must be two bytes of UTF-8 characters, so as to take D3 F4 as a whole analysis.
2. the binary representation of the dual-byte D3 F4 and D3 F4 is 110110011 11110100. In the preceding table, the binary value starting with 110 must start with 10, and F4 binary represents the beginning of 11, so D3 F4 can not find the corresponding encoding in the UTF-8.
3, because cannot match to the correct UTF-8 code, so discard D3, fill as UTF-8 missing characters ef bf bd, that is, Unicode placeholder U + fffd, symbol?
4. Analysis of F4, C3, C6 in turn can not match the correct UTF-8 code, is also filled with EF BF BD
5. The final converted byte stream is ef bf bd.
6. In the GBK encoding table, find the corresponding encoding and decode it into Chinese characters. Because ef bf corresponds to Baidu, and bd ef corresponds to bf bd, copy the corresponding data to get the result"
Therefore, because of the GBK encoding byte stream, decoding with the UTF-8 method, can not match, is converted to Unicode placeholder byte stream, so as to get the classic garbled "copy ".