UTF-8 format:
Note: X represents a value of 0 or 1. The range field is in hexadecimal notation, And the encoding form field is in binary notation.
Range encoding format
0x000000000000-0x0000007f 0 xxxxxxx
0x00000080-0x000007ff 110 XXXXX, 10 xxxxxx
0x00000800-0x0000ffff 1110 XXXX, 10 xxxxxx, 10 xxxxxx
0x0000000-0x0010ffff 11110xxx, 10 xxxxxx, 10 xxxxxx, 10 xxxxxx
The UTF-16 format is as follows:
Range encoding format
0x00000000-0x0000ffff XXXXXXXX, XXXXXXXX
0x000-0x0010ffff 110110xx, XXXXXXXX, 110111xx, XXXXXXXX
0x0000000-0x0010ffff is used to encode the original characters less than 0x00010000. 0xd800 and 0xdc00 are used as proxies, calculate the values of 10 bits and 10 bits in the previous step with 0xd800 and 0xdc00 respectively to obtain the high and low characters, and then splice them.
To be able to recognize 4-byte UTF-16 characters in a pile of UTF-16 characters that are both expressed in two bytes, we stipulate that if we see the value of two bytes between 0xd800-0xdcff, we assume that the two bytes and the last two bytes can constitute a single character. In this case, the 0xd800-0xdcff region of the 2-byte UTF-16 is used as a proxy, which is also the origin of the proxy. The meaning of this region is as follows:
0xd800-0xdb7f is a high replacement
0xdb80-0xdbff is a highly dedicated alternative
0xdc00-0xdcff is a low position replacement
The high-level special substitution is a character specially used to represent the 0xf0000-0x10ffff range, that is, the plane 15 and the plane 16, also become the special zone, so this high-level becomes a high-level special substitution.
UTF can be divided into big-tail and small-tail orders, also known as Big-end and small-end orders.
The middle and high bytes in the tail order are placed at the lower address (Front), and the lower bytes are placed at the higher address (back)
Assume that the result is 0xd950 0xdf21:
In the big tail order: 0xd950 0xdf21
In the tail order: 0x50d9 0x21df
If you write UTF16 encoded characters to a byte buffer, pay attention to the size and order.
If it is stored in the wchar_t array, you do not need to change the order of high bytes and low bytes.
In addition, we say that ucs2 is a subset of the UTF-16 and is the encoding scheme for the part except the four-byte encoding in the UTF-16.