UTF-8 coding rules

Source: Internet
Author: User

 

UTF-8 concepts. Address: http://www.utf.com.cn/article/s41-3

What is UTF-8?

First, only an integer is allocated to the character encoding table. there are several methods to represent a string of characters as a string of bytes. the two most obvious methods are to store Unicode text as strings of 2 or 4 byte sequences. the formal names of the two methods are UCS-2 and UCS-4, respectively. unless otherwise specified, most of the bytes are like this (bigendian Convention ). convert an ascii or Latin-1 file to a UCS-2 simply insert 0x00 before each ASCII byte. to convert to UCS-4, you must insert three 0x00 before each ASCII byte.

 

Using UCS-2 (or UCS-4) in UNIX can cause very serious problems. the encoded strings contain special characters, such as '/0' or'/', which have special meanings in the file name and other C-library function parameters. in addition, most UNIX tools that use ASCII files cannot read 16 characters without making major changes. for these reasons, in the file name, text file, environment variables and other places, UCS-2 is not suitable as Unicode external encoding.

 

The UTF-8 encoding defined in ISO 10646-1 Annex R and RFC 2279 does not have these problems. It is an obvious way to use Unicode in Unix-style operating systems.

 

UTF-8 has a characteristic:

The UCS character U + 0000 to U + 007f (ASCII) is encoded as byte 0x00 to 0x7f (ASCII compatible ). this means that files containing only 7 ASCII characters are the same in both ASCII and UTF-8 encoding methods.

All> U + 007f UCOS characters are encoded into a string of multiple bytes, each of which has a tag set. therefore, ASCII bytes (0x00-0x7f) cannot be part of any other character. the first byte of a non-ASCII multi-byte string is always in the range from 0xc0 to 0xfd, and indicates the number of bytes contained in the character. the remaining bytes of the multibyte string are in the range of 0x80 to 0 x BF. this makes re-synchronization very easy, and makes the encoding without borders, and is rarely affected by the loss of bytes.

It can be programmed into all possible 231 UTF-8-encoded characters of the UCS code. Theoretically, it can be up to 6 bytes long, but 16-bit BMP characters can be up to 3 bytes long.

The order of bigendian UCS-4 byte strings is predetermined. bytes 0xfe and 0xff are not used in UTF-8 encoding.

 

The following byte string is used to indicate a character. The string used depends on the character's serial number in Unicode.

 

U-00000000-U-0000007F: 0 xxxxxxx

U-00000080-U-000007FF: 110 XXXXX 10 xxxxxx

U-00000800-U-0000FFFF: 1110 XXXX 10 xxxxxx 10 xxxxxx

U-00010000-U-001FFFFF: 11110xxx 10 xxxxxx 10 xxxxxx 10 xxxxxx

U-00200000-U-03FFFFFF: 111110xx 10 xxxxxx 10 xxxxxx 10 xxxxxx

U-04000000-U-7FFFFFFF: 1111110x 10 xxxxxx 10 xxxxxx 10 xxxxxx 10 xxxxxx

 

The position of XXX is filled in by the binary representation of the number of characters. the closer X is to the right, the less special it has. use only the shortest multi-byte string that is sufficient to express the number of characters encoded. note that in a multi-byte string, the number of "1" starting with the first byte is the number of bytes in the entire string.

 

For example, Unicode Character U + 00a9 = 1010 1001 (copyright) in the UTF-8 is encoded:

 

11000010 10101001 = 0xc2 0xa9

 

The character U + 2260 = 0010 0010 0110 0000 (not equal to) is encoded:

 

11100010 10001001 10100000 = 0xe2 0x89 0xa0

 

The official name for this encoding is spelled as a UTF-8, where UTF stands for the UCS Transformation format. do not use other names (such as utf8 or utf_8) in any document to represent the UTF-8 unless you are referring to a variable name rather than the encoding itself.

 

 

Verify in Linux:

Main () <br/>{< br/> char * P = "medium"; <br/> int I; </P> <p> for (I = 0; I <strlen (p); I ++) <br/> printf ("% 02x/N", P [I] & 0xff ); <br/>}< br/> 

After editing, the source code is saved as UTF-8 and then compiled by GCC.

Running output:

E4

B8

Ad

 

"Medium" uses three bytes, And the binary code is:

E4: 11100100

B8. 10111000

AD: 10101101

 

Match with U-00000800-U-0000FFFF: 1110 XXXX 10 xxxxxx 10 xxxxxx

11100100 10111000 10101101


 


In addition, when GCC is used to generate an assembly, it is found that

. String "/344/270/255"

 

This is to convert each number to a binary value.

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.