Getting started with character set encoding in Java (ii) difference between coded character set and character set encoding

Source: Internet
Author: User
Tags abstract character set coding standards integer integer numbers require

Again, it is necessary to emphasize that both the historical UCS and today's Unicode, both refer to the coded character set, not the character set encoding. Take a little time to understand this, and then you will find that all the pages, the system, the coding standards of the back and forth between the conversion and so on complex affairs will be clear, extremely easy.

First, the most common sense of the character set.

An abstract character set is actually a collection of characters, for example, all English letters are an abstract character set, all Chinese characters are an abstract character set, of course, the symbols of all languages in the world together, can also be called an abstract character set, so this division is quite artificial. The word "abstract" is said to be because the characters mentioned here are not characters of any particular form, the character "Han" in Chinese characters says that the "Han" you see in this article is actually a specific manifestation of this character, its image manifestation, and it is written in Chinese (not pinyin). Use the song body appearance; when people use their mouths to pronounce "Han", they are using another concrete manifestation of "Han"--the sound, but in any case, both of the characters are the word "Han". The expression of the same character may have countless kinds (dot matrix representation, vector representation, audio representation, italics, cursive, etc.), the same character in each form into the character set, will make the collection too large, high redundancy, also bad management. Therefore, the character in the abstract character set refers to the only abstract character that exists, ignoring its specific representation.

There are many characters in the abstract character set that are not in order, and no one can say which character precedes which character, and this abstract character is only understandable. After assigning an integer number to each character in an abstract character set (note that the integer does not require a size), the character set is in order and becomes the coded character set. At the same time, by this number, you can uniquely determine which character is being referred to. Of course, for the same character, the different character sets have an integer number that is not the same, such as the word "son," which, in Unicode, is numbered as 0x513f (for convenience, in hexadecimal notation, But this integer number does not require that it must be in hexadecimal notation, which means that it is the first 0x513f character in the coded character set of Unicode. And in another coded character set such as Big5, the word is the first 0xa449 character. The flip side of this is that many characters are assigned the same integer number in different coded character sets, such as the English letter "A", which is the 0x41 character in ASCII and Unicode. We used to say the Unicode character set, this is the character set that is assigned an integer number, but it is clarified that the integer number assigned to the character in the coded character set is not necessarily the value that is used when the character is stored in the computer, and what binary integer values are used in the computer's stored characters. is determined by the character set encoding that will be mentioned below.

The character set encoding determines how the integer number of a character is mapped to a binary integer value, and some encoding scheme simply stores the integer value directly as its representation in the computer, such as English characters, and almost all of the character set encoding schemes The integer number of an English letter is the same as the binary form stored inside the computer. However, some coding schemes, such as the UTF-8 encoding for Unicode character sets, transform the integer numbers of a large part of the characters and store them in the computer. In the case of "Han", the Unicode value of "Han" is 0x6c49, but the value of the encoded UTF-8 format is 0xe6b189 (note that it becomes three bytes). Here for example, the detailed coding rules for UTF-8 can be referred to in the Mapping codepoints to Unicode encoding forms, with the URL "http://scripts.sil.org/cms/scripts/" PAGE.PHP?SITE_ID=NRSI&ITEM_ID=IWS-APPENDIXA#SEC3.

Another encoding scheme that we often hear about is UTF-16, which does not transform the first 65,536 character numbers in Unicode, directly as the value used by the computer when it is stored (for the characters after 65536, still to be transformed), for example, the Unicode number for the word "Han" is 0x6c49. Then the UTF-16 encoding is stored on the computer, and its representation is still 0x6c49!. I guess it's because of the existence of UTF-16 that many people think that Unicode is an encoding (in fact, a character set, again), and so many people say Unicode when they actually refer to UTF-16. The UTF-16 provides a surrogate pair mechanism that enables those characters in Unicode with code bits greater than 65536 to be represented.

Surrogate pair mechanism is not commonly used at present, even some UTF-16 implementation is not supported, so I do not intend to discuss here, the basic idea is to use two 16-bit encoding to represent a character (note, only for code bit more than 65536 characters do this). Unicode so dead embrace 16 of this number is not put, there are historical reasons, there are practical reasons.

And, of course, there is one of the strongest encodings, UTF-32, that he does not transform all Unicode characters, directly using the number store! (commonly known as status quo), only this coding scheme is too wasteful of storage space (even 1 bytes can be done in English characters, it must use 4 bytes), so although easy to use (no need for any conversion), but not popular.

Remember when Unicode and UCS have not married, UCS also need people love, need people pain, no own character set code how to become. UCS-2 and UCS-4 played such a role. UCS-4 and UTF-32 thought exactly the same except for their names. and UCS-2 and UTF-16 in the first 65,536 characters of the processing is exactly the same, the only difference is that UCS-2 does not support the surrogate pair mechanism, that is to say, UCS-2 can only be encoded on the first 65,536 characters, there is no way behind the characters. But now when it comes to character encoding, UCS-2 and UCS-4 have long been the words that computer historians will use, and let them remain in Hing.

In the next section, we'll talk about GB2312 and GBK related to Chinese.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.