Coding knowledge: Unicode and UTF-8

Source: Internet
Author: User

Some problems are knowledge problems. For example, in the program, we often use char szstr [32], strcpy (), sprintf (). When different encodings are involved, some situations may exceed our image. This is because we ignore some common sense. For example, we usually think that Chinese characters occupy 2 bytes. Otherwise, the UTF-8 may take 3 to 4 bytes. So when we were doing string processing, there was an out-of-the-boundary problem. I don't know yet. This is the original intention of the Code.

 

ISO 10646 is also known as the universal character set ). The length of the NFS code is 31 characters, which can contain 231 characters. If the two characters are encoded at the same height and only 16 characters are different, they belong to a plane (plane). Therefore, a plane consists of 216 characters. At present, most of the commonly used characters are located in the first plane (encoding range is U-00000000 ~ U-0000FFFD), known as BMP (Basic multilingual plane) or plane
0. For backward compatibility, the number is 0 ~ The character 256 is the same as Latin-1. UCOS encoding is usually expressed in the format of U-XXXXXXXX, while BMP encoding is usually used
In the format of U + XXXX, X is a hexadecimal number. At the same time that ISO developed the UCs, another joint manufacturer organization was also working on developing such encoding, known as Unicode. Later, the two jointly developed a unified encoding, but released their respective standard documents, therefore, the UCS encoding and Unicode code are the same.

With character encoding, another problem is how such encoding is represented in a computer. It is no longer possible to represent a character in one byte. The most direct idea is to represent a character in four bytes, which is called a UCS-4 or a UTF-32, and UTF is the abbreviation of Unicode Transformation format. On the one hand, this is a waste of storage space. Because common characters are concentrated in BMP, the two high bytes are usually 0. If only ASCII code or Latin-1 is used, the three high bytes are all 0. Another way to save storage space is to use two bytes to represent a character, known as a UCS-2 or UTF-16, which can only represent characters in BMP, but BMP has some extended characters, two such extended characters can be used to represent other flat characters, called surrogate pair. Both UTF-32 and UTF-16 have a more serious problem that is not compatible with C language, in C Language 0 byte represents the end of the string, library functionsstrlen,strcpyAnd so on all rely on this, if the string is stored in the UTF-32, there are many 0 bytes does not represent the end of the string, this is messy.

The UTF-8 coding proposed by Ken Thompson, father of UNIX, solves these problems well and is now widely used. UTF-8 has the following properties:

  • Encoding: U + 0000 ~ The U + 007f character occupies only one byte, that is, 0x00 ~ 0x7f, compatible with ASCII codes.

  • 2 ~ is used to encode characters larger than u + 007f ~ 6 bytes indicate that the highest bit of each byte is 1, while the highest bit of ASCII code is 0, therefore, non-ASCII characters do not contain ASCII characters (no 0 bytes ).

  • Used to represent multi-byte sequences of non-ASCII characters. The value range of the first byte is 0xc0 ~ 0xfd. Based on this, you can determine the number of bytes that are followed by the current character encoding. The value range of each subsequent byte is 0x80 ~ 0xbf, see the following detailed description.

  • All 231 characters defined by the UCS can be expressed in UTF-8 encoding.

  • The UTF-8 encoding is up to 6 bytes, And the BMP character UTF-8 encoding is up to three bytes.

  • 0xfe and 0xff are not present in UTF-8 encoding.

Specifically, UTF-8 encoding has the following formats:

U-00000000-U-0000007F: 0 xxxxxxx
U-00000080-U-000007FF: 110 XXXXX 10 xxxxxx
U-00000800-U-0000FFFF: 1110 XXXX 10 xxxxxx 10 xxxxxx
U-00010000-U-001FFFFF: 11110xxx 10 xxxxxx 10 xxxxxx 10 xxxxxx
U-00200000-U-03FFFFFF: 111110xx 10 xxxxxx 10 xxxxxx 10 xxxxxx
U-04000000-U-7FFFFFFF: 1111110x 10 xxxxxx 10 xxxxxx 10 xxxxxx 10 xxxxxx

The
One byte is either 0 (ASCII byte) or the maximum two digits are 1. The number of the first byte determines the number of subsequent bytes, which also belongs to the current character encoding, such as 111110xx
There are four more 1 after the high level, indicating that the last four bytes also belong to the encoding of the current character. The maximum two digits of each subsequent byte are 10 and can be separated from the first byte. This design is conducive to Error Code Synchronization. For example
If several bytes are lost during network transmission, it is easy to judge whether the current character is incomplete or where the next character starts, and at most one or two characters are lost in the result, it will not cause subsequent encoding.
Explaining all the chaos. In the above format, the bits marked as X are the UTF-8 encoded data. In the last 6-byte format, there are 31 X BITs, which can represent 31 bits, UTF-8 is like a train, first
Each byte is the front of the car, and each byte is the rear of the car. The cargo carrying the byte is the ucscode. The UTF-8 specifies that the host's UCS encoding is expressed in the upper-end, that is, the X in the first byte is the height of the UCS encoding.
Bit. The X in the next byte is the low level of the UCS encoding.

For example, U + 00a9 (character) binary is 10101001, encoded as UTF-8 is 11000010 10101001 (0xc2 0xa9), but cannot be encoded as 11100000 10000010 10101001, the UTF-8 requires that each character must be encoded in as few bytes as possible.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.