The article on encoding written in the Java Hut of jiasber (1) differences between the encoding Character Set and the character set encoding

Source: Internet
Author: User
Tags coding standards integer numbers

It should be emphasized again that, whether in the history of the UCS or the current Unicode, both referEncoding Character SetInsteadCharacter Set Encoding. It takes a little time to understand this matter, and then you will find that the logic of switching back and forth between all webpages, systems, and coding standards is clear and easy to understand.

First, let's talk about the character set in the most general sense.
One Abstract Character Set It actually refers to the character set. For example, all English letters are an abstract character set, and all Chinese characters are an abstract character set. Of course, all the symbols of all languages in the world are put together, it can also be called an abstract character set, so this division is quite artificial. The reason for this is that the character mentioned here is not a specific character. Take the Chinese character "Han" as an example. Article The "Han" shown in is actually a specific manifestation of the character, its image representation, and it is written in Chinese (rather than pinyin, when people use their mouths to make the voice "Han", they are using another specific form of expression "Han"-voice, but in any case, both refer to the word "Han. There may be several forms of representation of the same character (Dot Matrix Representation, vector representation, audio representation, photo body, cursive, etc ), if you include the same character in each form into the character set, the collection is too large, redundant, and difficult to manage. Therefore, the characters in the abstract character set are all unique abstract characters, ignoring their specific representations.

Many characters in the abstract character set have no order. No one can tell which character is before which, and only one can understand this abstract character. After each character in an abstract character set is assigned an integer number (note that this integer does not require the size), the character set is ordered and becomesEncoding Character Set . At the same time, you can use this number to uniquely identify which character it refers. Of course, for the same character, different character sets have different integer numbers, for example, the word "child". In Unicode, its number is 0x513f, (For convenience, it is expressed in hexadecimal notation, but this integer number does not need to be expressed in hexadecimal notation.) This indicates that it is the 0x513f character set in Unicode. In another encoding character set, such as big5, this character is 0xa449. In this case, many characters are assigned the same integer number in different encoding character sets, such as the English letter "A", in ASCII and Unicode, it must be 0x41 characters long. The Unicode Character Set is a character set that is assigned an integer number. However, to be clear, it is not necessarily the value used when the character is stored in the computer. What binary integers are used to represent the character stored in the computer? Character Set Encoding .

(According to the object-oriented understanding, the abstract character set is an abstract class. It declares an abstract class "Han", but it is "say" or "write, it is necessary to inherit the specific implementation of this class.
The encoding character set is a subclass of the preceding abstract class and declares a subclass of "Han". How can this subclass implement "Han? Specify an integer for the "Han.
The following character set encoding. Er... If we regard the subclass as an interface, character set encoding is implemented in a computer. It seems that the more complicated it is, let's take a look at the original author's wonderful explanation!
)

 
Character Set encoding determines how a character's integer number is mapped to a binary integer. Some encoding schemes simply store this integer number as its representation in a computer, for example, in almost all character set encoding schemes, the integer numbers of English letters are the same as the binary format stored in the computer. However, some encoding schemes, such as the UTF-8 encoding form suitable for Unicode character sets, convert the integer numbers of a large part of the characters and store them in the computer. Take the word "Han" as an example, the Unicode Value of "Han" is 0x6c49, but its encoded value is 0xe6b189 after the UTF-8 format (note that it is changed to three bytes ). Here is just an example, detailed Encoding Rules for The UTF-8 can be referred to the Mapping codepoints to unicode encoding forms, URL is http://scripts.sil.org/cms/scripts/page.php? Site_id = nrsi & item_id = iws-AppendixA # sec3.
We often hear about another encoding scheme of UTF-16, the first 65536 character numbers in Unicode are not converted, directly as the value used for computer storage (for characters later than 65536, for example, if the Unicode Number of the Chinese character is 0x6c49, it is still indicated as 0x6c49 when it is stored on the computer after being encoded by the UTF-16 !. I guess it is because of the existence of UTF-16 that many people think Unicode is a kind of encoding (in fact, it is a character set, again reiterated), and so many people say Unicode, they actually mean UTF-16. The UTF-16 provides a surrogate pair mechanism that represents characters in Unicode whose bitwise is greater than 65536.
Surrogate pair mechanism is not commonly used at present, even some UTF-16 implementation is not supported, so I do not intend to discuss more here, the basic idea is to use two 16-bit codes to represent a single character (note that this is only true for characters with more than 65536 characters ). Unicode is so hard to hold the number 16, there are historical reasons, there are also practical reasons.

 

(Here I am superficial understanding, Unicode is the character set of encoding-Here encoding refers to the character to specify a number; and UTF-8, UTF-16 is the encoding of character sets -- the encoding here refers to how the number is stored in the computer, but the storage policy is different .)

Of course there is also a kind of the strongest encoding, UTF-32, he does not transform all Unicode characters, directly use number storage! (This encoding scheme is a waste of storage space (even one byte can handle English characters, it must use four bytes ), therefore, although it is easy to use (no conversion is required), it is not widely used.
I remember that when Unicode and UCOS were not yet home, they also needed love and pain. How can they be achieved without their own character set encoding. UCS-2 and UCS-4 play this role. UCS-4 and UTF-32 in addition to different names, the idea is exactly the same. The UCS-2 and UTF-16 are also identical in the processing of the first 65536 characters, the only difference is that the UCS-2 does not support surrogate pair mechanism, that is, the UCS-2 can only encode the first 65536 characters, there is no way to follow the character. But now talking about character encoding, UCS-2 and UCS-4 have become computer historian will use words, let them continue to stay in the pile of paper.

In the next section, we will talk about gb2312 and GBK related to Chinese characters.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.