I. Universal Character Set (UCS)
ISO/IEC 10646-1 [ISO-10646] defines a character set of more than 8 bits, called a universal Character set (UCS), which contains most of the world's written character systems. Two more than 8 bit-byte encodings have been defined, with four 8-bit bytes encoded for each character called UCS-4, with two 8-byte encodings for each character called UCS-2. They are able to address only the first 64K characters of the UCS, and the other parts of the range are not currently allocated for addressing.
Second, the basic multilingual surface (BMP)
ISO 10646 defines a 31-bit character set. However, in this vast coding space, only the first 65,534 code bits (0x0000 to 0xFFFD) have been allocated so far. The 16-bit subset of this UCS is called the "Basic Multilingual Interface" (Elementary multilingual Plane, BMP).
Third, Unicode encoding
Historically, there were two independent attempts to create a single character set. One is the ISO 10646 project of the International Organization for Standardization (ISO) and the other is a Unicode project organized by a consortium of multilingual software manufacturers (mostly in the United States at first). Fortunately, around 1991, two participants in the project realized that the world does not need two different single character sets. They combine the work of both sides and work together to create a single coding table. All two projects still exist and independently publish their respective standards, but the Unicode Association and the ISO/IEC JTC1/SC2 both agree to maintain the compatibility of the code tables of the Unicode and 10646 standards and to work closely together to adjust any future extensions. The Unicode standard additionally defines a number of semantic semiotics related to characters and is generally a better reference for achieving high-quality print publishing systems.
Four, UTF-8 code
UCS-2 and UCS-4 encodings are difficult to use in many current applications and protocols, which assume that the character is a byte of 8 or 7 bits. Even a new system that can handle 16-bit characters cannot process UCS-4 data. This situation leads to a development called the UCS Conversion format (UTF), each of which has different characteristics. UTF-8 (RFC 2279), which uses all bits of 8 bits, retains the nature of the entire US-ASCII range: Us-ascii characters are encoded with a 8-bit byte, using the usual us-ascii value, so Any 8-bit byte under this value represents only one us-ascii character, not another character. It has the following characteristics:
1) It is easy to convert each of the UTF-8 to Ucs-4,ucs-2.
2 The first 8-bit byte of the 8-bit byte sequence indicates the number of 8-bit bytes in the series.
3) The 8-bit byte value FE and FF will never appear.
4 It is easier to find where the character boundaries begin in the 8-bit character stream.
UTF-8 definition:
In UTF-8, characters are encoded in sequences of 1 to 6 8-bit bytes. In just one sequence of 8-bit bytes, the byte's high is 0, and the other 7 bits are used for character-value encoding. N (n>1) a sequence of 8-bit bytes, the initial 8-bit byte is high n-bit 1, followed by 0, and the remainder of the byte contains bits of the encoded character value. The top digit of all 8-bit bytes followed by 1, followed by 0, and the remaining 6 bits of each byte containing the bit of the encoded character.
The following table summarizes these different 8-bit byte type formats. The letter x indicates that this bit comes from the UCS-4 character value being encoded.
UCS-4范围(16进制) UTF-8 系列(二进制)
0000 0000<- >0000 007F 0xxxxxxx
0000 0080<->0000 07FF 110xxxxx 10xxxxxx
0000 0800<->0000 FFFF 1110xxxx 10xxxxxx 10xxxxxx
0001 0000<->001F FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
0020 0000<->03FF FFFF 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
0400 0000<->7FFF FFFF 1111110x 10xxxxxx ... 10xxxxxx
The encoding rules from UCS-4 to UTF-8 are as follows:
1 determines the required number of 8 bits from the character value and the first column in the table above. It is emphasized that the rows in the table above are mutually exclusive, that is to say, there is only one valid encoding for a given UCS-4 character.
2 Prepare a high of 8 byte bytes as per row in the second column of the table.