UTF-8 and UTF-16 formats

Source: Internet
Author: User

UTF-8 format:

 

Note: X represents a value of 0 or 1. The range field is in hexadecimal notation, And the encoding form field is in binary notation.

 

Range encoding format

0x000000000000-0x0000007f 0 xxxxxxx

0x00000080-0x000007ff 110 XXXXX, 10 xxxxxx

0x00000800-0x0000ffff 1110 XXXX, 10 xxxxxx, 10 xxxxxx

0x0000000-0x0010ffff 11110xxx, 10 xxxxxx, 10 xxxxxx, 10 xxxxxx

 

 

The UTF-16 format is as follows:

 

Range encoding format

0x00000000-0x0000ffff XXXXXXXX, XXXXXXXX

0x000-0x0010ffff 110110xx, XXXXXXXX, 110111xx, XXXXXXXX

 

0x0000000-0x0010ffff is used to encode the original characters less than 0x00010000. 0xd800 and 0xdc00 are used as proxies, calculate the values of 10 bits and 10 bits in the previous step with 0xd800 and 0xdc00 respectively to obtain the high and low characters, and then splice them.

 

To be able to recognize 4-byte UTF-16 characters in a pile of UTF-16 characters that are both expressed in two bytes, we stipulate that if we see the value of two bytes between 0xd800-0xdcff, we assume that the two bytes and the last two bytes can constitute a single character. In this case, the 0xd800-0xdcff region of the 2-byte UTF-16 is used as a proxy, which is also the origin of the proxy. The meaning of this region is as follows:

 

0xd800-0xdb7f is a high replacement

0xdb80-0xdbff is a highly dedicated alternative

0xdc00-0xdcff is a low position replacement

 

The high-level special substitution is a character specially used to represent the 0xf0000-0x10ffff range, that is, the plane 15 and the plane 16, also become the special zone, so this high-level becomes a high-level special substitution.

 

UTF can be divided into big-tail and small-tail orders, also known as Big-end and small-end orders.

The middle and high bytes in the tail order are placed at the lower address (Front), and the lower bytes are placed at the higher address (back)

 

Assume that the result is 0xd950 0xdf21:

In the big tail order: 0xd950 0xdf21

In the tail order: 0x50d9 0x21df

 

If you write UTF16 encoded characters to a byte buffer, pay attention to the size and order.

If it is stored in the wchar_t array, you do not need to change the order of high bytes and low bytes.

 

In addition, we say that ucs2 is a subset of the UTF-16 and is the encoding scheme for the part except the four-byte encoding in the UTF-16.

 

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.