In detail: Unicode, UTF-8, UTF-16, UTF-32, UCS-2, UCS-4

Source: Internet
Author: User

1. Unicode and ISO 10646

Many countries around the world are coding their own words, and do not have the same, different language character encoding values are the same but represent different symbols (for example: "???" in the Korean code EUC-KR The encoded value is exactly the "exhausted silk" in Chinese character coding GBK. Therefore, the same document, copied to different languages of the machine, it may become garbled, so people think: We can define a large character set, it will be able to accommodate all the characters in the world, and then the unified encoding, so that each character corresponds to a different encoding value, so there will be no garbled.

If "Each country is in the language of their own independent code" is the schools of thought, then "establish the world's unified character code" is the unified lake, who want to do this martial arts leader. There were two earlier institutions trying to do this:
(1) The International Organization for Standardization (ISO), which created ISO/IEC JTC1/SC2/WG2 Working Group in 1984, attempted to develop a "universal Character Set" (Universal Character set, referred to as UCS), and eventually developed the standards for the 10646.
(2) The Unified Code Alliance, which consists of Xerox, Apple and other software manufacturers in 1988, and developed the Unicode standard (the Unicode standards, this prefix uni is very cool---Unique, Universal, and Uniform).

Around 1991, participants in two projects realized that the world did not need two incompatible character sets. As a result, they began to merge the work of both sides and work together to create a single coding table. Starting with Unicode 2.0, Unicode uses the same font and loadline as ISO 10646-1, and ISO promises that ISO 10646 will not assign values beyond the U+10ffff UCS-4 encoding to make them consistent. Two projects are still independent and publish their own standards independently. However, since Unicode is a well-written word, it is more widely used.

The Unicode encoding points are divided into 17 planes (plane), each containing 216 (or 65536) yards (code point). The code bits of 17 planes can be represented as from u+xx0000 to U+xxffff, where xx represents a hexadecimal value from 0016 to 1016, which amounts to 17 planes.

2. UTF-32 and UCS-4

Before Unicode was merged with ISO 10646, the ISO 10646 standard defined a 31-bit encoded form (i.e., UCS-4) for the Universal Character set (UCS), with a fixed encoding of 4 bytes and an encoded space of 0x00000000~ 0x7FFFFFFF (can encode 20多亿个 characters).

UCS-4 has 20多亿个 encoded space, but the actual use range is not more than 0X10FFFF, and in order to be compatible with the Unicode standard, ISO also promises not to assign values for UCS-4 codes that exceed 0x10ffff. The UTF-32 encoding is proposed, and its encoded value is the same as UCS-4, except that its coding space is limited to 0~0X10FFFF. So it can also be said thatUTF-32 is a subset of UCS-4 .

3. UTF-16 and UCS-2

In addition to the Ucs-4,iso 10646 standard defines a 16-bit encoded form (UCS-2) for the Universal Character set (UCS), whose encoding is fixed at 2 bytes and contains 65,536 encoded spaces (which can be encoded for the world's most commonly used 63K characters, in order to be compatible with Unicode, The code bit between the 0XD800-0XDFFF is not used). Example: The UCS-2 code of "Han" is 6c49.

But the two bytes are not enough to really "Unified Lake" (a fixed-width 2-byte encoding could not encode enough characters to be truly universal), so UTF-16 was born, with UC Like S-2, it uses two bytes to encode the most commonly used 63K characters in the world, but it uses 4 bytes to encode infrequently used characters. UTF-16 is a variable-length encoding.

As mentioned earlier: The Unicode encoding points are divided into 17 planes (plane), each of which contains 216 (i.e. 65536) code point, and the first plane is called the "basic multi-language plane" (Basic multilingual plane, referred to as BMP) , the remaining planes are called "auxiliary planes" (supplementary Planes). Where the code bits between 0xd800~0xdfff in the "Basic Multilingual Plane" (0~0XFFFF) are reserved and unused. UCS-2 can only encode characters in the "base multi-language plane", at which point the UTF-16 is the same as the encoding of the UCS-2 (both use Unicode code bits directly as encoded values), for example: "Han" is 6c49 in Unicode, and UTF-16 is 6c49. In addition, the UTF-16 can encode the code bits of the "auxiliary plane" characters using the code bits of the reserved 0xd800-0xdfff segment, so UTF-16 can encode all the characters in Unicode.

How to encode the "secondary plane" in UTF-16?

Unicode has a code-bit interval of 0~0X10FFFF, with the exception of the basic multi-language plane, which has a 0xFFFFF code bit (and its value is greater than or equal to 0x10000). For the characters in the auxiliary plane, if you subtract 0x10000 from the code bit values in Unicode, you can get a 0~0xfffff interval (any value in that interval can be represented by a 20-bits number). The first 10 bits (bits) of this number, plus 0xd800, get the first two bytes of the UTF-16 four-byte encoding, and the last 10 bits (bits) of the number plus 0xdc00, the last four bytes in the UTF-16 two-byte encoding are obtained. For example:
(What does this word read?) ^_^)
The Unicode code bit value of the above character is 2AEAB, minus 0x10000 to get 1AEAB (binary value is 0001 10101010 1011), the first 10 bits plus D800 get d86b, After 10 bits plus DC00 get deab. The UTF-16 encoded value of the word is D86bdeab (the value is big-endian, and the small end is 6bd8abde).

4. UTF-8

As can be seen from the foregoing: whether it is UTF-16/32 or UCS-2/4, a character needs multiple bytes to encode, which is a waste of bandwidth for those English-speaking countries! (especially in the age when the Internet speed is not fast.) Thus, the UTF-8 produced. In UTF-8 encoding, the ASCII code character is also the ASCII code value, only one byte to represent, the remaining characters need 2 bytes, 3 bytes or 4 bytes to represent.

Coding rules for UTF-8:

(1) for symbols in ASCII code, a single-byte encoding is used with the same encoded value as the ASCII value (see: U0000.pdf). Where the range of ASCII values is 0~0x7f, the first bit of all encoded binary values is 0 (this can be used to differentiate between single-byte encodings and multibyte encodings).

(2) Other characters are encoded in more than one byte (assuming n bytes), the multibyte encoding must satisfy: The first n bits are 1, the n+1 bit is 0, the first two bits of the next N-1 bytes are 10, and the remaining bits in N bytes are all used to store the code-bit values in Unicode.

Number of bytes Unicode UTF-8 Encoding
1 000000-00007f 0xxxxxxx
2 000080-0007ff 110xxxxx 10xxxxxx
3 000800-00ffff 1110xxxx 10xxxxxx 10xxxxxx
4 010000-10ffff 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

5. Summary

(1) To put it simply: Unicode is a character set, not encoded, UTF-8, UTF-16, etc. are encoded for the Unicode character set.

(2) Comparison of UTF-8, UTF-16, UTF-32, UCS-2 and UCS-4:

Compare UTF-8 UTF-16 UTF-32 UCS-2 UCS-4
Coded space 0-10ffff 0-10ffff 0-10ffff 0-ffff 0-7fffffff
Minimum number of encoded bytes 1 2 4 2 4
Maximum number of bytes encoded 4 4 4 2 4
Whether to rely on the byte order Whether Is Is Is Is

Reference:

    • Wikipedia: Unicode (Chinese version)
    • Wikipedia: Universal Coded Character Set (Chinese version)
    • Wikipedia: UTF-8 (Chinese version)
    • Wikipedia: UTF-16 (Chinese version)
    • Wikipedia: UTF-32 (Chinese version)
    • Faq:utf-8, UTF-16, UTF-32 & BOM
    • Unicode 8.0 Character Code Charts
    • CJK Unified ideographs (Han)
    • Nanyi: Character-coded notes: Ascii,unicode and UTF-8
    • UCS vs UTF-8 as Internal String Encoding

In detail: Unicode, UTF-8, UTF-16, UTF-32, UCS-2, UCS-4

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.