Encoding format details

Last Update:2018-12-04 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

I believe that many people, like me, have always been confused about coding. Do you know? We will know that gb2312 is a Chinese encoding. When we see garbled characters, we will know that the encoding is wrong, but what is the problem? Why is the problem, except for the problem. In particular, some people asked: Why. I don't know, so I just said: this is the case, and the encoding is used for decoding. Why.
In fact, after understanding the problem, we can really understand why it is like this, And the encoding problem is not so profound. I searched some materials online, it is a solution to the coding problem for a long time. There are two materials that are most understandable. The first one is:
Http://www.cnblogs.com/KevinYang/archive/2010/06/18/1760597.html
I think this encoding is the clearest. After reading it at the beginning, I thought I really understood it. gb2312 is just a character set, unicode is a collection of all characters that can be used, and UTF-8 is an encoding. Its character set is Unicode. In addition, Unicode is a huge character set agreed upon by all humans, including the gb2312 character set.
With this "Confidence", I began my exploration:
1. Use notepad to write an XML file, as shown below:

XML Code

   <? XML version = "1.0" encoding = "gb2312"?> <Root> <person> Hi, big girl </person> </root>

Save this notepad as UTF-8 format, open it, garbled, don't think about it, the feeling is: UTF-8 encoded files are decoded using gb2312, some UTF-8 characters are not available in gb2312, something is going wrong... (This understanding is wrong !!!)
2. Use notepad to write an XML file, as shown below:

XML Code

   <? XML version = "1.0" encoding = "UTF-8"?> <Root> <person> Hi, big girl </person> </root>

Save this notepad as ANSI (in the simplified Chinese operating system, it is gb2312 encoding), and then open it with garbled characters ??? Why? Does Unicode contain the gb2312 character set? I should be able to parse it as I said ??? (The first idea is continued. It is determined only by the size of the character set. It is still wrong)
No way. I can only search for answers online, so the second document appears.
Http://social.msdn.microsoft.com/Forums/zh-CN/2212/thread/f656ec85-2cd0-4d6a-a207-fe30523cc5a4/
The question is like this: "The UTF-8 contains the definition of all the characters in gb2312, but the number of each character is not exactly the same, so there will be situations where the text in the page cannot match"
For the answer, see the answer provided by the Raymond Tang moderator on the 4th floor.
In this case, we can explain why UTF-8 cannot parse the characters in gb2312.
However, I think there is still a problem with this answer. As Raymond Tang said, "The number of each character is not exactly the same, the encoding in gb2312 is different from that in UTF-8, so the correct characters cannot be parsed. If the encoding is different, it will not change much at least. Should it be similar to Chinese characters? Why is it garbled?
So I looked at the first article carefully and found the problem: "gb2312 and GBK character sets are limited to 2 bytes at most to encode all characters, it also specifies the byte order. This encoding system usually uses a simple look-up table, that is, through the code page, you can directly map characters to the byte stream on the storage device .", "Although each character can find a unique serial number (encoding Code, also known as Unicode Code) in the Unicode character set, but the final byte stream is determined by the specific character encoding, it is determined by UTF-8. Simply put, the bytes stream is disrupted. UTF-8 is encoded in a variable length mode and cannot parse the byte stream as gb2312 does. So garbled characters appear.
Therefore, Unicode contains the character set in gb2312, but each encoding and decoding method is different. That is to say, UTF-8 encoding is a rule for its own byte stream, you must use your own rules for decoding. If they are inconsistent, garbled characters may occur. This is the root cause.

A very interesting thing was also found during the process: When you create a text document, you only enter the word "Unicom" to save it and open it again, it will be garbled.
See: http://baike.baidu.com/view/1273097.htm

This post is my own summary, so that I may forget to remind myself and hope to be helpful to some children's shoes that I don't understand, let those who have long understood these things laugh...
If I still understand the deviation, I hope you can correct it. Thank you.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Encoding format details

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Encoding format details

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support