Unicode encoding and implementation

Source: Internet
Author: User

Generally speaking, Unicode encoding systems can be divided into two levels: encoding mode and implementation mode.
 

 

1.Encoding Method
 

 

Unicode is a character encoding scheme developed by international organizations to accommodate all texts and symbols in the world. Unicode maps these characters with numbers 0-0x10ffff. It can contain up to 1114112 characters, or contain 1114112 characters. The bitwise is the number that can be allocated to characters. UTF-8, UTF-16, and UTF-32 are encoding schemes that convert numbers to program data.
 

 

The Unicode character set can be abbreviated to UCS (UNICODE character set ). Early Unicode standards had the saying of UCS-2 and UCS-4. The UCS-2 is encoded in two bytes, And the UCS-4 is 4 bytes encoded. The UCS-4 is divided into 2 ^ 7 = 128 groups based on the highest byte with the highest bit of 0. Each group is further divided into 256 planes based on the next high byte ). Each plane is divided into 3rd rows based on 256 bytes, and each row has 256 cells ). The plane 0 of group 0 is called BMP (Basic multilingual plane ). Remove the bmp of the UCS-4 from the first two zero bytes to get the UCS-2.
 

 

Each plane has 2 ^ 16 = 65536 bits. The Unicode program uses 17 planes with a total of 17*65536 = 1114112 code bits. In Unicode 5.0.0, there are only 238605 defined code bits distributed on the plane 0, plane 1, plane 2, plane 14, plane 15, and plane 16. In this example, on plane 15 and plane 16, only two private use areas (private use area) with 65534 codes each are defined, namely 0xf0000-0xffffd and 0x100000-0x10fffd. A special zone is a region reserved for custom characters, which can be abbreviated as Pua.
 

 

Plane 0 also has a dedicated zone: 0xe000-0xf8ff, with 6400 code bits. 0xd800-0xdfff of the plane 0 has a total of 2048 code bits. It is a special area called surrogate. The purpose of the proxy area represents a character other than BMP with two UTF-16 characters. This is introduced when introducing UTF-16 encoding.
 

 

As mentioned above, in Unicode 5.0.0, 238605-65534*2-6400-2408 = 99089. The remaining 99089 defined code bits are distributed on Plane 0, plane 1, plane 2, and plane 14. They correspond to the 99089 characters currently defined by Unicode, including 71226 Chinese characters. The numbers 0, 1, 2, and 14 contain 52080, 3419, 43253, and 337 characters. The 43253 characters in plane 2 are all Chinese characters. The plane 0 defines 27973 Chinese characters.
 

 

2.Implementation Method
 

 

In Unicode, the number corresponding to the Chinese character "word" is 23383. In Unicode, we have many ways to express a number 23383 into program data, including: UTF-8, UTF-16, UTF-32. UTF is the abbreviation of "uctransformation format". It can be translated into a Unicode Character Set conversion format, that is, how to convert Unicode-defined numbers into program data. For example, the numbers corresponding to the "Chinese character" are 0x6c49 and 0x5b57, while the encoded program data is:
 

 

Byte data_utf8 [] = {0xe6, 0xb1, 0x89, 0xe5, 0xad, 0x97}; // UTF-8 Encoding
 

 

Word data_utf16 [] = {0x6c49, 0x5b57}; // UTF-16 code
 

 

DWORD data_utf32 [] = {0x6c49, 0x5b57}; // UTF-32 code
 

 

Here, byte, word, and DWORD are used to represent unsigned 8-bit integers, unsigned 16-bit integers, and unsigned 32-bit integers respectively. UTF-8, UTF-16, UTF-32 respectively byte, word, DWORD as the encoding unit. The UTF-8 encoding of Chinese characters requires 6 bytes. The UTF-16 encoding of Chinese character requires two words, the size is 4 bytes. The UTF-32 encoding of Chinese character requires two DWORD, the size is 8 bytes. Depending on the order of bytes, The UTF-16 can be implemented as a UTF-16LE or a UTF-16BE that can be implemented as a UTF-32 or a UTF-32LE. The following describes UTF-8, UTF-16, UTF-32, byte order, and BOM.
 

 

  UTF-8
 

 

The UTF-8 encodes Unicode in bytes. The encoding method from Unicode to UTF-8 is as follows:
 

 

Unicode encoding (hexadecimal) bytes UTF-8 byte stream (Binary)
 

 

000000-00007f 00000 xxxxxxx
 

 

000080-0007ff limit 110 XXXXX 10 xxxxxx
 

 

000800-00 FFFF 00001110 XXXX 10 xxxxxx 10 xxxxxx
 

 

010000-10 FFFF limit 11110xxx 10 xxxxxx 10 xxxxxx 10 xxxxxx
 

 

The UTF-8 is characterized by the use of different length encoding for characters in different ranges. For characters between 0x00-0x7f, The UTF-8 encoding is exactly the same as the ASCII encoding. The maximum length of a UTF-8 encoding is 4 bytes. From the table above, we can see that the 4-byte template has 21 x, which can hold 21 binary numbers. The maximum size of Unicode is 0x10ffff, which is only 21 characters.
 

 

Example 1: The Unicode code of the Chinese character is 0x6c49. 0x6c49 is between 0x0800-0xffff and uses a 3-byte template: 1110 XXXX 10 xxxxxx 10 xxxxxx. Write 0x6c49 as binary: 0110 1100 0100 1001. Use this bit stream to replace X in the template. The result is 11100110 10110001 10001001, that is, E6 B1 89.
 

 

Example 2: Unicode code 0x20c30 is between 0x00000-0x10ffff. The 4-byte template is used: 11110xxx 10 xxxxxx 10 xxxxxx 10 xxxxxx. Write 0x20c30 as a 21-bit binary number (if less than 21 bits are filled with 0): 0 0010 0000 1100 0011. Use this bit stream to replace X in the template in sequence: 11110000 10100000 10110000 10110000, that is, F0 A0 B0 B0.
 

 

  UTF-16
 

 

The UTF-16 code is measured in 16-bit unsigned integers. Unicode encoding is recorded as U. The encoding rules are as follows:
 

 

If u <0x10000, The UTF-16 code of U is the 16-bit unsigned integer corresponding to U (for easy writing, the 16-bit unsigned integer is recorded as word below ).
 

 

If u ≥ 0x10000, we calculate u '= U-0x10000 first, then u' is written in binary form: yyyy yyxx xxxx, u UTF-16 encoding (Binary) is: 110110 yyyyyyyyyy 110111 xxxxxxxxxx.
 

 

Why can U' be written as 20 binary digits? The maximum size of Unicode is 0x10ffff. After 0x10000 is subtracted, the maximum value of U' is 0 xfffff. Therefore, it must be expressed as 20 binary digits. For example, Unicode code 0x20c30. After 0x10000 is subtracted, 0x10c30 is obtained. The binary code is 0001 0000 1100 0011. Replace Y in the template with the first 10 digits in sequence, and replace X in the template with the last 10 digits in sequence. The result is 1101100001000011 1101110000110000, that is, 0xd843 0xdc30.
 

 

According to the above rules, Unicode code 0x0000-0x10ffff UTF-16 encoding has two words, the first word of the high 6 bits is 110110, the second word of the high 6 bits is 110111. It can be seen that the value range (Binary) of the first word is 11011000 00000000 to 11011011 11111111, that is, 0xd800-0xdbff. The value range (Binary) of the second word is 11011100 00000000 to 11011111 11111111, that is, 0xdc00-0xdfff.
 

 

To separate the UTF-16 encoding of a word from the UTF-16 encoding of two words, the Unicode encoding designer keeps 0xd800-0xdfff, known as the proxy zone (surrogate ):
 

 

D800-DB7F sans high surrogates sans high substitution
 

 

DB80-DBFF limit high private use surrogates limit high private alternative
 

 

DC00-DFFF lower low surrogates lower position substitution
 

 

High substitution means that the code bit in this range is the first word of the UTF-16 code of two words. Low substitution means that the bitwise of this range is the second word of the UTF-16 code of two words. So what does high-end dedicated substitution mean? Let's answer this question and, by the way, see how unicode encoding is derived from UTF-16 encoding.
 

 

If the first word of a character's UTF-16 encoding is between 0xdb80 and 0xdbff, in what range is its unicode encoding? We know that the value range of the second word is 0xdc00-0xdfff, so the UTF-16 encoding range of this character should be 0xdb80 0xdc00 to 0 xdbff 0 xdfff. We will write this range as binary:
 

 

1101101110000000 11011100 00000000-1101101111111111 1101111111111111
 

 

Take the last 10 digits of high and low word and put them together.
 

 

1110 0000 0000 0000-0000 1111 1111 1111 1111
 

 

That is, 0xe0000-0xfffff. Add 0x10000 to the opposite of the encoding to get 0xf0000-0x10ffff. This is the Unicode encoding range of the first word in UTF-16 encoding between 0xdb80 and 0xdbff, that is, the plane 15 and the plane 16. Because the Unicode standard uses both plane 15 and plane 16 as the dedicated zone, the reserved code bit between 0xdb80 and 0xdbff is called a high-level dedicated alternative.
 

 

  UTF-32
 

 

The UTF-32 encoding is in 32-bit unsigned integers. The UTF-32 encoding of Unicode is its 32-bit unsigned integer.
 

 

  Byte order
 

 

Depending on the order of bytes, The UTF-16 can be implemented as a UTF-16LE or a UTF-16BE that can be implemented as a UTF-32 or a UTF-32LE. For example:
 

 

Unicode encoding character UTF-16LE character UTF-16BE character UTF32-LE character UTF32-BE
 

 

0x006c49 000049 6C 00006c 49 000049 6C 00 00 000000 00 6C 49
 

 

0x020c30 11643 D8 30 DC 2017d8 43 DC 30 000030 0C 02 00 000000 02 0C 30
 

 

So how can we determine the byte sequence of a byte stream? We recommend that you use BOM (byte order mark) to distinguish the byte order. That is, before transmitting a byte stream, it is transmitted as the BOM character "Zero Width, no interrupt space ". The character encoding is feff, and the reverse fffe (UTF-16) and fffe0000 (UTF-32) are undefined bitwise in Unicode and should not appear in actual transmission. The following table lists the BOM of various UTF codes:
 

 

UTF Encoding bytes byte order mark
 

 

UTF-8 ║ EF BB BF
 

 

UTF-16LE 0000ff fe
 

 

UTF-16BE 127fe FF
 

 

UTF-32LE 0000ff Fe 00
 

 

UTF-32BE 000000 00 Fe FF

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.