Unicode, UTF-8, UTF-16, utf-32, ucs16, ucs32 relationships

Source: Internet
Author: User
Coding knowledge Summary

The earliest encoding is ASCII, which is only 1-127, expressed in one byte. And the first bit of this byte is 0.

Later, many countries found that ASCII characters are too few. For example, Chinese characters cannot be expressed. Therefore, every country developed its own extended code, such as gb2312 in China, big5 of Taiwan, Japanese shift-JIS, etc. The extended code in each country is the same, that is, the extended code with the maximum length of 2 is used, which is mainly used to maintain compatibility with ASCII. Generally, the encoding standard is that for characters in ASCII, it must be in an ascii-compatible format, that is, the first digit is 0, but for their own texts in different countries, two bytes are used for representation, 2 ^ 16 = 65535 words are used for saving two words, which is generally enough. The general practice is to set the first byte to 1. In this way, the computer will think that this is an extended encoding when it sees the first character as 1, the next character is added together and the two characters are treated as one word.

Gb2312 is such a variable-length code with a maximum length of 2. gb2312 contains a total of 7445 characters, including 6763 Chinese characters. Its Encoding range is 212h-777eh. We can see that, gb2312 does not all use the space of these two bytes. It can be expanded.

GBK is expanded on the basis of gb2312. Because more than 6000 Chinese characters defined by gb2312 are not enough, GBK encoding has emerged, and GBK remains compatible with gb2312. On this basis, some unusual Chinese characters are added. GBK has a total revenue of 21886 Chinese characters and symbols. GBK is also a variable-length code with a maximum length of 2 bytes. Its Encoding range is between 8140-fefe.

Extended encoding (big5, shift-JIS) in other countries also adopts the same idea as gb2312/GBK and uses variable-length encoding up to two bytes, because it can be compatible with ASCII.

However, there are more than these Chinese characters in China, and there are also some especially uncommon Chinese characters. It seems that there are more than 70 thousand Chinese characters, which is beyond the range of two-character energy-saving representation. In this case, gb18030 encoding is generated.

Gb18030 is a variable-length code with a maximum length of 4 bytes. It is backward compatible with gb2312 and GBK. In addition, it extends many characters, totaling more than 70 thousand characters, in this way, the length of the gb18030 encoding may be 1, 2, and 4 bytes. For the gb2312 and GBK compatible parts, it uses two bytes. When the two bytes are not enough, it uses 4 bytes of encoding. The encoding space of gb18030 is about 1.6 million bits. More than 20 thousand bits have been encoded before. The range of gb18030: one byte is compatible with ASCII, the first byte is 0x81-0xfe, And the last byte is 0x40-0x7e and 0x80-0xfe. It is compatible with GBK. In the four-byte section, the first byte starts from 0x81-0xfe, the second byte ranges from 0x30-0x39, the third and fourth byte ranges from the first two bytes, and the fourth byte overwrites from 0x0080, except for all the unicode3.1 bitwise that has been covered by the two-byte part, that is to say, the gb18030 encoding corresponds to the Unicode Standard in the bitwise space.

Although all countries have defined their own codes, they are not common to each other, because they can only be used in 65535 locations in two bytes, but it does not form a unified system. For example, if you use which segment I use, mutual encoding is not uniform, after a gb2312 code segment is parsed on shift-JIS, it becomes garbled. In this case, Unicode encoding is generated for unification.

Unicode defines more than one million characters, including the characters in the encoding of each country (such as gb2312, GBK, and big5). If all the characters are displayed in a unified format, it takes four bytes. In fact, this is UTF32, And the UTF32 scheme is used in Linux. However, the analysis shows that most characters can be expressed in two bytes, which can save space. For example, the Widnows uses the Unicode solution of two bytes, also known as UTF16, in UTF16, for characters that cannot be expressed in two bytes, use a proxy method to expand (in fact, a flag is made on the lower two bytes to indicate that this is a code, the following two bytes must be connected to form a single character ). Therefore, Unicode refers to UTF16 on Windows, while Unicode refers to UTF32 in Linux.

The name of Unicode is "Universal multiple-octer coded character set". For short, UCOS is short for "Unicode Character Set". Therefore, UCOS actually refers to Unicode. Unicode standard character encoding. UCOS specifies how to use multiple bytes to represent characters.

Ucs2. ucs2 adopts two formats: ucs2 and ucs4. ucs2 uses two bytes for encoding and ucs4 uses four bytes for encoding. (actually, only 31 bits are used, and the highest bits must be 0 ).

Both ucs2 and ucs4 are fixed-length codes. ucs2 has 2 ^ 16 = 65535 bits, and ucs4 has 2 ^ 31 = 2147483648 bits.

Ucs4 is divided into 2 ^ 7 = 128 groups based on the maximum byte with the highest bit of 0. Each group is divided into 256 Plane Based on the next high byte, and each group is divided into 3rd rows based on 256 bytes, each row contains 256 cells. The plane 0 of group 0 is called Basic multilingual plane, that is, BMP. That is to say, ucs2 can only represent the BMP part of ucs4. The conversion of BMP ucs2 and ucs4 encoding is very simple. ucs2-> ucs4, as long as the first two bytes of the encoding are all 0, ucs4-> ucs2 removes all the first two bytes of the encoding. Ucs2 can only represent the <= 65535 characters that are occasionally defined by BMP. However, the preceding ucs4 specification does not contain any characters other than BMP. (It seems that the new standard has exceeded BMP. Otherwise, the extended characters of gb18030 cannot be placed in BMP ).

Unicode is an encoding method. To use Unicode in practice, you also need to encode Unicode (although it is good, it is not suitable for storing it in the file system. Because ASCII is converted to ucs2, only a 0x0 is added before the encoding. Some controllers such as/will appear in these encodings, which are in UNIX and some C functions, will generate a serious error .), This produces a UTF-7, A UTF-8, A UTF-16, A UTF-32.

UTF-8 is an 8-bit character set with a variable encoding length ranging from 1 byte to 6 bytes. UTF-8 remains compatible with ASCII. In general, UTF-8 uses one byte to represent ASCII characters, two bytes to represent Western European characters, and three characters to represent most Asian characters. UNIX platforms generally support UTF-8. Most HTML, file storage, and transmission use UTF-8.

UTF-16 is also a variable-length encoding, but it is not ASCII compatible. UTF-16 is ucs2 superset, it is actually ucs2 plus additional characters support, that is, in line with the unicode4.0 specification ucs2. UTF16 must contain at least two bytes to indicate one character, and four bytes to indicate the characters appended to ucs2. So the UTF-16 is either 2 bytes or 4 bytes. UTF-16 is the main use of Windows platform encoding scheme, mainly in Windows2000 or later versions. Windows wchar_t is two bytes, should be UTF-16.

A utf-32 is a fixed-length code that is almost the same as ucs4. Utf-32 encoding each code uses 4 bytes, and Linux uses the utf-32 solution.

UTF is the encoding scheme, so it also involves the issue of byte order. Byte order mark (BOM) appears at the beginning of the Unicode stream, indicating the encoding type. Bom is a bit clever: In the UCS code, there is a character called "Zero Width no-break space", encoded as feff, while fffe does not exist in the UCS, therefore, it should not appear in the actual transmission. We recommend that you first transmit "Zero Width no-break space" before transmitting the byte stream ", in this way, when the recipient receives this character, it can be used to determine the byte order. Below is a commonly used BOM:

> UTF-16 big endian Fe FF
> UTF-16 little endian FF fe
> UTF-32 bign endian 00 00 Fe FF
> UTF-32 little endian FF Fe 00 00
> UTF-8 little endian EF BB BF

When we read these characters at the beginning of the Unicode stream, we can confirm the encoding sequence. UTF-8 is encoded in bytes, so there is no byte order problem, but it can be represented by BOM, the UTF-8 encoding of the character "Zero Width no-break space" is ef bb bf. Therefore, if you receive a byte stream starting with ef bb bf, it indicates that it is UTF-8 encoded. In Windows, if you use NotePad to save a UTF-8 file, its header starts with ef bb bf. In Linux, to maintain ASCII compatibility, all UTF-8 files do not contain the ef bb bf header.

One advantage of the UTF-8 design is that It encodes a set of vertices into word throttling, rather than words or Dwords, which can ignore the underlying machine's endian question. This means that you can exchange UTF-8 streams between two machines with a small-tail byte order and a large-tail byte order without any need to restructure or add Bom. That is to say, you can completely ignore the underlying architecture.

Another advantage of UTF-8 encoding is that it stores the bit of the actual code point from left to right and sorts the string by the original byte in binary form. Although this is not as good as sorting by locale sorting rules, but for the underlying system that does not need to understand the UFF-8, it eventually provides a very simple sort method, the underlying system only needs to know how to sort the original bytes.

Summary:

Unicode is a standard, which has two formats: ucs2 and ucs4. Both ucs2 and ucs4 are fixed. Basically it can be understood that ucs4 is a UTF-32, and ucs2 is compatible with the UTF-16, but the UTF-16 extends some.

UTF is Unicode implementation, it is divided into UTF-8, UTF-16, utf-32 several forms, of which UTF-8 and UTF-16 are variable length, and utf-32 is fixed length encoding. (In fact there are utf-7 and Other encoding)

Attachment: byte layout of UTF-8:

> Bytes Number of digits Indicates
> 1 7 0 bbbbbbb
> 2 11 110 bbbbb 10 bbbbbb
> 3 16 1110 BBBB 10 bbbbbb 10 bbbbbbbb
> 4 21 11110bbb 10 bbbbbb 10 bbbbbbbb 10 bbbbbbbb
> 5 26 111110bb 10 bbbbbb 10 bbbbbbbb 10 bbbbbbbb 10
> 6 31 1111110b 10 bbbbbb 10 bbbbbbbb 10 bbbbbbbb 10 bbbbbbbb
> 7 36 11111110 10 bbbbbb 10 bbbbbb 10 bbbbbbbb 10 bbbbbbbb 10 bbbbbbbb
> 8 42 11111111 10 bbbbbb 10 bbbbbbbb 10 bbbbbbbb 10 bbbbbbbb 10 bbbbbbbb

From: http://blog.csdn.net/meteor1113/archive/2009/07/15/4350390.aspx

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.