If the Unicode character is represented by 2 bytes, it is likely that it will take 3 bytes to encode into an UTF-8. If a Unicode character is represented by 4 bytes, it may take 6 bytes to encode into UTF-8. It may be too much to encode a Unicode character with 4 or 6 bytes, but you will rarely encounter such Unicode characters. The UTF-8 conversion table is represented as follows:
Unicode/ucs-4 |
Bit number |
UTF-8 |
BYTE number |
Note |
0000 ~007f |
0~7 |
0 XXX XXXX |
1 |
|
0080 ~07ff |
8~11 |
the X xxxxXX xxxx |
2 |
|
0800 ~FFFF |
12~16 |
1110 xxxxtenxx xxxxxx xxxx |
3 |
Basic definition Range: 0~FFFF |
17~21 |
1111 0 XXX ten XX XXXX Ten xx xxxx ten xx xxxx |
4 |
unicode6.1 definition scope: 0~10 FFFF |
20 0000 ~3ff FFFF |
22~26 |
1111 xx xx XXXX 10 xx xxxx xx xxxx xx xxxx |
5 |
Description: This non-Unicode encoding range, which belongs to the early specification of UCS-4 encoding UTF-8 can reach a 6-byte sequence that can be overwritten to 31-bit (the original limit of the universal Character set). Nonetheless, in November 2003 UTF-8 was re-regulated by RFC 3629, using only the areas defined by the original Unicode, u+0000 to U+10FFFF. According to the specification, these byte values will not appear in the legal UTF-8 sequence |
0000 ~7FFF FFFF |
27~31 |
1111 Xtenxx xxxxxx xxxx xxxxxx xx xxxxxx xxxx |
6 |
Unicode characters that actually represent ASCII characters, are encoded in 1 bytes, and the UTF-8 representation is the same as the ASCII character representation. Converting all other Unicode characters into UTF-8 will require a minimum of 2 bytes. Each byte is started by a code-changing sequence. The first byte consists of a unique code-changing sequence, consisting of an n-bit continuous 1 plus a bit 0, and the number of consecutive 1 bytes of the first byte represents the number of characters required for the character encoding. When Unicode is converted to UTF-8, binary digits can be taken from the low to the high of the Unicode binary, each fetch 6 bits, as the above binary can be removed as shown in the following example format, before the format to fill, less than 8 bits with 0 fill. Note: The number of bytes required for Unicode conversion to UTF-8 can be calculated according to this rule: if Unicode is less than 0x80 (ASCII character), it is converted to 1 bytes. Otherwise the converted number of bytes is Unicode bits minus 1 and divided by 5.
About Utf-8 (Online search)