Encoding format details

Source: Internet
Author: User

I believe that many people, like me, have always been confused about coding. Do you know? We will know that gb2312 is a Chinese encoding. When we see garbled characters, we will know that the encoding is wrong, but what is the problem? Why is the problem, except for the problem. In particular, some people asked: Why. I don't know, so I just said: this is the case, and the encoding is used for decoding. Why.
In fact, after understanding the problem, we can really understand why it is like this, And the encoding problem is not so profound. I searched some materials online, it is a solution to the coding problem for a long time. There are two materials that are most understandable. The first one is:
Http://www.cnblogs.com/KevinYang/archive/2010/06/18/1760597.html
I think this encoding is the clearest. After reading it at the beginning, I thought I really understood it. gb2312 is just a character set, unicode is a collection of all characters that can be used, and UTF-8 is an encoding. Its character set is Unicode. In addition, Unicode is a huge character set agreed upon by all humans, including the gb2312 character set.
With this "Confidence", I began my exploration:
1. Use notepad to write an XML file, as shown below:

XML Code
   <? XML version = "1.0" encoding = "gb2312"?> <Root> <person> Hi, big girl </person> </root>

Save this notepad as UTF-8 format, open it, garbled, don't think about it, the feeling is: UTF-8 encoded files are decoded using gb2312, some UTF-8 characters are not available in gb2312, something is going wrong... (This understanding is wrong !!!)
2. Use notepad to write an XML file, as shown below:

XML Code
   <? XML version = "1.0" encoding = "UTF-8"?> <Root> <person> Hi, big girl </person> </root>

Save this notepad as ANSI (in the simplified Chinese operating system, it is gb2312 encoding), and then open it with garbled characters ??? Why? Does Unicode contain the gb2312 character set? I should be able to parse it as I said ??? (The first idea is continued. It is determined only by the size of the character set. It is still wrong)
No way. I can only search for answers online, so the second document appears.
Http://social.msdn.microsoft.com/Forums/zh-CN/2212/thread/f656ec85-2cd0-4d6a-a207-fe30523cc5a4/
The question is like this: "The UTF-8 contains the definition of all the characters in gb2312, but the number of each character is not exactly the same, so there will be situations where the text in the page cannot match"
For the answer, see the answer provided by the Raymond Tang moderator on the 4th floor.
In this case, we can explain why UTF-8 cannot parse the characters in gb2312.
However, I think there is still a problem with this answer. As Raymond Tang said, "The number of each character is not exactly the same, the encoding in gb2312 is different from that in UTF-8, so the correct characters cannot be parsed. If the encoding is different, it will not change much at least. Should it be similar to Chinese characters? Why is it garbled?
So I looked at the first article carefully and found the problem: "gb2312 and GBK character sets are limited to 2 bytes at most to encode all characters, it also specifies the byte order. This encoding system usually uses a simple look-up table, that is, through the code page, you can directly map characters to the byte stream on the storage device .", "Although each character can find a unique serial number (encoding Code, also known as Unicode Code) in the Unicode character set, but the final byte stream is determined by the specific character encoding, it is determined by UTF-8. Simply put, the bytes stream is disrupted. UTF-8 is encoded in a variable length mode and cannot parse the byte stream as gb2312 does. So garbled characters appear.
Therefore, Unicode contains the character set in gb2312, but each encoding and decoding method is different. That is to say, UTF-8 encoding is a rule for its own byte stream, you must use your own rules for decoding. If they are inconsistent, garbled characters may occur. This is the root cause.

A very interesting thing was also found during the process: When you create a text document, you only enter the word "Unicom" to save it and open it again, it will be garbled.
See: http://baike.baidu.com/view/1273097.htm

This post is my own summary, so that I may forget to remind myself and hope to be helpful to some children's shoes that I don't understand, let those who have long understood these things laugh...
If I still understand the deviation, I hope you can correct it. Thank you.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.