You do not know the character set and encoding (encoding and character set encoding), but do not know the encoding character set.

Source: Internet
Author: User

You do not know the character set and encoding (encoding and character set encoding), but do not know the encoding character set.

In my previous article, a friend proposed the differences between character sets and encoding. I will discuss it with you here.

The difference between character set and encoding is actually the difference between the encoding Character Set and character set encoding. In fact, if we only talk about character sets, there is no concept of encoding, character Set is actually a simple character set, or an abstract character set, including text, symbols, and so on, there is only a collection of such various standard characters.

If it is just an abstract character set, we do not need to discuss it, because there is no objection, it is easy to understand, and the character set is often referred to as the encoding character set, for example, common unicode, ascii, gb2312, and gbk are called character sets (actually encoding character sets, for example, unicode is essentially a character set that has been "encoded". Each character has a unique integer number and each character has its own unique number, the numbers of the same character are different in different encoding character sets. Of course, many encoding character sets are supersets of ascll. Therefore, the numbers of the ascll character set are the same as those of many encoding character sets, for example, the English letter "A" is 0x41 characters in ASCII, Unicode, and GB2312, speaking of this, I must have noticed that I have mentioned that "unicode is actually an encoding character set that has already been" encoded "." The double quotation marks are added to the "encoding" field, I want to emphasize that the "encoding" here is not the encoding I want to talk about below. Here, I just compiled a corresponding number for each character, however, we are still used to the title of "encoding character set"

We often say that "the article uses UTF-8 encoding"

My personal understanding of the meaning of this encoding method is:The integer number of a character is used to correspond to the integer number of a specific binary value and stored on the computer.This is very different from the "encoding" in the above encoding character set. Here we call it "character set encoding", which we often call encoding.

Speaking of this, many people will think of the difference between unicode and UTF-8? Since unicode is an encoding Character Set, What Is UTF-8? Is it commonly referred to as encoding?

"The article uses the UTF-8 encoding method." I personally think it is "the article uses the UTF-8 encoding scheme based on the unicode encoding Character Set", that is

That is to say, unicode itself is not stored in any form as the encoding character set. It is only a table corresponding to numbers and characters. How can we store it on a computer? You may think that the serial number is directly stored as a binary value. Why not? This is also a character set encoding scheme, is based on unicode encoding Character Set of utf-32 encoding scheme, then there is a more intelligent coding scheme? Why not? That is UTF-8, UTF-16, etc. When I explain why the UTF-8 encoding scheme is used, I must explain one thing:

In the previous article "Page encoding you don't know, the browser chooses encoding, get, and post various garbled Origins", I said: "How to view hexadecimal strings of Chinese characters? Method: BitConverter. toString (System. text. encoding. UTF8.GetBytes ("Dorf"); "Please note that I can change to" System. text. encoding. unicode. getBytes "If vs2013 Encoding is used, type". "Smart notification

(The list is too long. Use two images respectively)

Two questions:

1. If unicode is the encoding Character Set, why does it appear in the list tied to the UTF-8 encoding scheme?

2. Why are both ASCII and gb2312 encoding character sets listed in parallel with UTF-8 encoding schemes?

Let's assume there are two guesses:

1. unicode here is not a real unicode encoding character set. It may be just a very close link to the unicode encoding character set.

2. ASCII or gb2312 (in fact, it is the Default in the figure, that is, the current encoding of the operating system, which is generally gb2312 in China) is the correct encoding character set, however, there is only one ASCII or gb2312 encoding, so I can call them ASCII or GB2312 encoding, why should I add ascii and gb2312 TO THE LIST tied to the UTF-8 encoding scheme?

My two assumptions will soon be demonstrated

1. You can see in the metadata of Encoding:

1         //
2 // Summary:
3 // Get UTF-16 encoding using Little-Endian byte order.
4 //
5 // Returns:
6 // UTF-16 encoding using Little-Endian byte order.
7 public static Encoding Unicode {get;}

Here the unicode here is actually essentially "getting the encoding in the UTF-16 format that uses Little-Endian byte order", even if the UTF-16 encoding scheme based on the unicode encoding character set, similar to BigEndianUnicode (gets the encoding in the UTF-16 format that uses the Big Endian byte order)
2. The common ASCII or gb2312 character set can be called ASCII encoding, but it only has different meanings, because there is only one ASCII or gb2312 encoding character set, namely, ASCII and GB2312 encoding, the ASCII and GB2312 displayed in the list refer to the ASCII and GB2312 encoding schemes instead of the encoding character sets. I think this is the reason that most of the time, gb2312 or ascii can be used for both Character Set assignment and encoding scheme assignment, for example:
Encoding gb2312 = Encoding. GetEncoding ("gb2312 ");
Response. ContentEncoding = gb2312; // code
Response. Charset = "gb2312"; // Character Set

Summary:
Unicode is the character set, not the encoding! However, ascii (gb2312) is a character set. This statement must be correct, but I cannot make a big mistake when expressing it as "ascii Code". However, this is a misunderstanding, if you have to say "ascii encoding character set encoding"

If we understand the arguments of the above two assumptions, we will continue to discuss the previously paused topic, that is, "explain why UTF-8 and Other encoding schemes are used (similar to other utf Encoding Schemes )"
UTF-8 converts the integer numbers of a large part of Characters Based on the unicode Character Set and stores them in the computer. (Reference) take the word "Han" as an example, the Unicode Value of "Han" is 0x6C49, but its encoded value is 0xE6B189 after the UTF-8 format (note that it is changed to three bytes ). For the UTF-16 encoding scheme, the first 65536 character numbers in the unicode encoding character set are not transformed and directly used as the value stored by the computer (for characters later than 65536, for example, when the Unicode Number of the Chinese character is 0x6C49, it is still indicated as 0x6C49 when it is stored on the computer after UTF-16 encoding, for the UTF-32 encoding scheme, he does not change all Unicode characters and uses serial number Storage directly, but this encoding scheme is too much storage space (even one byte can handle English characters, it must contain 4 bytes)

Since there are so many encoding schemes for unicode character sets
UTF-8, letters, numbers, and other symbols take up 1 byte, and Chinese characters take up 3 bytes
UTF-16, which occupies two bytes for the first 65536 characters in the unicode encoding Character Set
Utf-32, all 4 bytes

If someone asks:
"Unicode encoding occupies several bytes." We can say with confidence that the first unicode is not encoding! The number of bytes of each second character depends on the encoding scheme!

Many interview questions will be asked:

1 string param = "abcAdolf";
2 int length1 = System.Text.Encoding.Unicode.GetBytes (param) .Length; // Don't forget that the unicode here is essentially a UTF-16 encoding scheme
3 int length2 = param.Length; 

The answer is 12 and 6.

Finally, the numbers of Characters in the gb2312 or ascii character set are the binary numbers directly stored in the computer. That is to say, there is only one encoding scheme for both the gb2312 and ascii character sets, because the numbers in the ascii character set in the gb2312 encoding character set are not changed (that is, they are the same as those in the ascii character set ), therefore, the ascii characters of gb2312 still occupy 1 byte in the binary data stored in the computer. The binary numbers of Chinese characters stored in the computer are also the numbers of these Chinese characters in the gb2312 encoding character set, generally, this number is converted into two bytes of the binary number. This process becomes the so-called gb2312 encoding.
If System. Text. Encoding. Default. GetBytes (param). Length is changed, the values are 9 and 6.

To learn more about the internal encoding principles, refer:
Http://blog.csdn.net/nodeathphoenix/article/details/7057760

 


Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.