Ext: UTF16 encoding format (Unicode contact with UTF16)

Source: Internet
Author: User

Transferred from: http://www.cnblogs.com/dragon2012/p/5020259.html

UTF-16 is a conversion of the Unicode character set, which converts the Unicode code points to a 16-strong code element serial for data storage or delivery. The UTF-16 encoding rules are as follows:


2.2.1 Code bits from u+d800 to U+DFFF (proxy area)

Because the Unicode character set has a coded value range of 0-0X10FFFF, and the encoded value of a secondary plane area greater than or equal to 0x10000 cannot be represented by 2 bytes, the Unicode Standard stipulates that: Within the basic multilingual plane, u+d800. The value of U+DFFF does not correspond to any character, which is the proxy area. Therefore, the UTF-16 uses the code bits of the reserved 0xd800-0xdfff segment to encode the code bits of the auxiliary plane's characters.

But in the era of using UCS-2, u+d800. The values within the U+DFFF are occupied, and are used for mapping certain characters. But as long as it does not constitute a proxy pair, many UTF-16 codecs can correctly identify and convert these non-Unicode-compliant character mappings into compliant code elements. According to the Unicode standard, this code element serial should be counted as a coding error.


2.2.2 code from u+0000 to u+d7ff and from u+e000 to U+FFFF

The first Unicode plane (BMP), the code bit from u+0000 to U+FFFF (removing the agent area), contains the most commonly used characters. UTF-16 and UCS-2 code in this range is a single 16-specific code element, the value is equivalent to the corresponding code bit. These code bits in BMP are the only code bits that can be represented in UCS-2.


2.2.3 code bit from u+10000 to U+10FFFF

The code points in the auxiliary plane (supplementary Planes), greater than or equal to 0x10000, are encoded in UTF-16 as a pair of 16-specific elements (that is, 32bit,4bytes), called code units called a The agent pair (surrogate pair), the specific method is:

Ø code points minus 0x10000, the range of the resulting values is 20 compared to the special features of 0. 0xFFFFF (because the maximum Unicode code bit is 0x10ffff, minus 0x10000, the maximum value is 0xfffff, so it can certainly be represented by 20 bits), written in binary form: YYYY yyyy yyxx xxxx xxxx.

Ø 10-bit value of high (the range of values is 0). 0X3FF) is added 0xd800 to get the first code element or the high-level agent (higher surrogate), the range of values is 0xd800. 0xDBFF. Because the high-level agent is smaller than the low value proxy, the Unicode standard is now called the High Proxy as the leading agent (lead surrogates) in order to avoid confusion.

Ø Low 10-bit value (the range of values is also 0.) 0X3FF) is added 0xdc00 to get a second code element or called low surrogate, and now the range of values is 0xdc00. 0xDFFF. Because the low-level agent is larger than the high-level proxy, the Unicode standard is now called the low-rear agent (trail surrogates) in order to avoid confusing use.

Ø the final UTF-16 (4 bytes) encoding (binary) is: 110110yyyyyyyyyy 110111xxxxxxxxxx.


According to the above rules, Unicode encoding 0x10000-0x10ffff has a UTF-16 encoding of two word, the first word has a height of 6 bits of 110110, and the second word has a height of 6 bits of 110111. Visible, the first word's range of values (binary) is 11011000 00000000 to 11011011 11111111, or 0XD800-0XDBFF. The second word's range of values (binary) is 11011100 00000000 to 11011111 11111111, which is 0xdc00-0xdfff. The code bit from u+d800 to U+DFFF (proxy area) is to differentiate between a word (2 bytes) UTF-16 encoding and the UTF-16 encoding of two word.

Because of the high agent, low agent, the code of the valid characters in BMP, the three do not overlap, the search is simple: a part of the character encoding is not possible with another character encoding different parts of the overlap. This means that UTF-16 is self-synchronizing (self-synchronizing): You can determine the starting code for the next character of a given character by checking only one code element. UTF-8 also has similar advantages, but many of the earlier coding patterns were not so, and it was necessary to parse the text from scratch to determine the boundaries of the different character's code elements.

Because the most common characters are in the basic multilingual plane, many software processing agent pairs of parts are often not fully tested. This has led to some long-term bugs with potential security vulnerabilities, even in the widely popular and well-evaluated application software

Ext: UTF16 encoding format (Unicode contact with UTF16)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.