Common code Detailed

Source: Internet
Author: User
Tags character set range

I. Universal Character Set (UCS)

ISO/IEC 10646-1 [ISO-10646] defines a character set of more than 8 bits, called a universal Character set (UCS), which contains most of the world's written character systems. Two more than 8 bit-byte encodings have been defined, with four 8-bit bytes encoded for each character called UCS-4, with two 8-byte encodings for each character called UCS-2. They are able to address only the first 64K characters of the UCS, and the other parts of the range are not currently allocated for addressing.

Second, the basic multilingual surface (BMP)

ISO 10646 defines a 31-bit character set. However, in this vast coding space, only the first 65,534 code bits (0x0000 to 0xFFFD) have been allocated so far. The 16-bit subset of this UCS is called the "Basic Multilingual Interface" (Elementary multilingual Plane, BMP).

Third, Unicode encoding

Historically, there were two independent attempts to create a single character set. One is the ISO 10646 project of the International Organization for Standardization (ISO) and the other is a Unicode project organized by a consortium of multilingual software manufacturers (mostly in the United States at first). Fortunately, around 1991, two participants in the project realized that the world does not need two different single character sets. They combine the work of both sides and work together to create a single coding table. All two projects still exist and independently publish their respective standards, but the Unicode Association and the ISO/IEC JTC1/SC2 both agree to maintain the compatibility of the code tables of the Unicode and 10646 standards and to work closely together to adjust any future extensions. The Unicode standard additionally defines a number of semantic semiotics related to characters and is generally a better reference for achieving high-quality print publishing systems.

Four, UTF-8 code

UCS-2 and UCS-4 encodings are difficult to use in many current applications and protocols, which assume that the character is a byte of 8 or 7 bits. Even a new system that can handle 16-bit characters cannot process UCS-4 data. This situation leads to a development called the UCS Conversion format (UTF), each of which has different characteristics. UTF-8 (RFC 2279), which uses all bits of 8 bits, retains the nature of the entire US-ASCII range: Us-ascii characters are encoded with a 8-bit byte, using the usual us-ascii value, so Any 8-bit byte under this value represents only one us-ascii character, not another character. It has the following characteristics:

1) It is easy to convert each of the UTF-8 to Ucs-4,ucs-2.

2 The first 8-bit byte of the 8-bit byte sequence indicates the number of 8-bit bytes in the series.

3) The 8-bit byte value FE and FF will never appear.

4 It is easier to find where the character boundaries begin in the 8-bit character stream.

UTF-8 definition:

In UTF-8, characters are encoded in sequences of 1 to 6 8-bit bytes. In just one sequence of 8-bit bytes, the byte's high is 0, and the other 7 bits are used for character-value encoding. N (n>1) a sequence of 8-bit bytes, the initial 8-bit byte is high n-bit 1, followed by 0, and the remainder of the byte contains bits of the encoded character value. The top digit of all 8-bit bytes followed by 1, followed by 0, and the remaining 6 bits of each byte containing the bit of the encoded character.

The following table summarizes these different 8-bit byte type formats. The letter x indicates that this bit comes from the UCS-4 character value being encoded.

UCS-4范围(16进制)   UTF-8 系列(二进制)
  0000 0000<- >0000 007F  0xxxxxxx
  0000 0080<->0000 07FF  110xxxxx 10xxxxxx
  0000 0800<->0000 FFFF  1110xxxx 10xxxxxx 10xxxxxx
  0001 0000<->001F FFFF  11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
  0020 0000<->03FF FFFF  111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
  0400 0000<->7FFF FFFF  1111110x 10xxxxxx ... 10xxxxxx

The encoding rules from UCS-4 to UTF-8 are as follows:

1 determines the required number of 8 bits from the character value and the first column in the table above. It is emphasized that the rows in the table above are mutually exclusive, that is to say, there is only one valid encoding for a given UCS-4 character.

2 Prepare a high of 8 byte bytes as per row in the second column of the table.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: and provide relevant evidence. A staff member will contact you within 5 working days.

Tags Index: