Introduction and mutual conversion of Chinese character encoding formats such as UTF-8 and GBK

Source: Internet
Author: User
Tags character set range

We have a lot of time to use the Chinese encoding format, such as GBK, gb2312, and so on, but because it is mainly for Chinese encoding settings, so it is not completely universal, so there is the need to switch between the various encodings, such as the conversion of the UTF8. But in the course of my use, I found that the encoding conversion was not as simple as it might be, or that it could be wrong, even if you used the system API. I was in use, generating some confusion, and the search material did not completely solve my problem, so I sorted out this article. The end of the article listed some of my references or code implementation, thank you here.

This article first introduced the Chinese code, only to do a simple introduction, does not involve detailed principles (links can be read at the end of this article), and then the example verifies the problem between the encoding conversion.

Introduction to each encoding format

GB2312

GB2312 is the People's Republic of China's national character information exchange code, the full name of the "Information exchange with Chinese character coded character set-Basic set", issued by the National Standards Bureau, May 1, 1981, the implementation of the mainland. This code is also used in Singapore and other fields. GB2312 included simplified and symbols, letters, kana, etc. a total of 7,445 graphic characters, of which Chinese characters accounted for 6,763. GB2312 stipulates that "for any graphic character is represented by two bytes, each byte is represented by seven-bit encoding", the habit of saying the first byte is "high byte", the second byte is "low byte."

GB2312 's coding range is 2121h-777eh, overlapping with ASCII, the method is to the GB code two bytes highest position 1 to differentiate.

GBK

GB2312 only 6,763 Chinese characters, this is much less than the existing Chinese characters, with the passage of time and the continuous extension of Chinese culture, some of the original words are rarely used, and now become the usual characters, such as: Zhu Rongji's "Rong" word, not income gb2312-80, now newspaper published only use (Gold + Rong), (Jinjong), (Zuojin right), etc. to express, different forms and the same, which makes the presentation, storage, input, processing are very inconvenient, and this expression does not have a unified standard. In order to solve these problems, as well as with the implementation of Unicode, the National Information Technology Technical Committee on December 1, 1995, "Chinese Code extension code." GBK is fully compatible with GB2312 and supports the ISO 10646 international Standard, which plays a connecting link between the former and the latter in the transition process. GBK also uses double-byte representations, that is, both Chinese and English characters are used in two-byte to express, in order to distinguish Chinese, its highest bit is set to 1, the overall coding range of 8140-fefe, the first byte between the 81-fe, the tail byte between 40-fe, remove xx7f a line.

GBK has a total income of 21,886 Chinese characters and graphic symbols, including:

* All Chinese characters and non-Chinese symbols in GB2312;

* All Chinese characters in the BIG5;

* Other CJK characters in the corresponding national standard GB13000 of ISO 10646, the above total 20,902 Chinese characters;

* Other Chinese characters, radicals, symbols, a total of 984;

GB18030

GB18030 is the latest national standard for Chinese character coded character sets, backward-compatible GBK and GB2312 standards. The GB18030 encoding is a 124-byte variable-length encoding. One byte portion is compatible with ASCII encoding from 0x0~0x7f. The two byte portion, the first byte from the 0x81~0xfe, the tail byte from the 0x40~0x7e and the 0x80~0xfe, is basically compatible with the GBK standard. Four bytes, the first byte from the 0x81~0xfe, the second byte from the 0x30~0x39, the third and fourth byte ranges are the same as the first two bytes. The four-byte section overrides all Unicode 3.1 yards that have been overwritten by the two-byte portion, starting at 0x0080. That is, the GB18030 encoding corresponds to the Unicode standard one by one in the code-bit space, similar to the UTF-8 encoding.

Unicode

More Wonderful content: http://www.bianceng.cnhttp://www.bianceng.cn/web/Skills/

Unicode is a character encoding scheme developed by an international organization that can hold all the words and symbols in the world. Unicode maps these characters with a numeric 0-0x10ffff that can hold up to 1,114,112 characters. Unicode is just a set of symbols that specify the binary code of symbols, but do not specify how the binary should be stored. For example, UTF-8, UTF-16, UTF-32 are all implementations of Unicode encoding, but UTF-8 is the most used implementation.

UTF-8

Utf-8:unicode transformation Format-8bit is allowed with a BOM, but usually does not contain a BOM. is a kind of multi-byte encoding used to solve the international character, and it is the most popular way to realize Unicode in the Internet. One of the biggest features of UTF-8 is that it is a variable length encoding. It can use 1~4 byte to represent a symbol, depending on the different symbols to change the length of bytes, so you can save storage space. It uses 8 bits (that is, one byte) in English, and the Chinese uses 24 for (three bytes) to encode. UTF-8 contains the characters that all countries in the world need to use, is international code, strong universality. UTF-8 encoded text can be displayed on browsers that support UTF8 character sets in countries. If it is UTF8 code, in the foreigner's English IE can also display Chinese, they do not need to download IE Chinese Language support package.

UTF-8 's coding rules are simple, only two:

1 for Single-byte symbols, the first bit of the byte is set to 0, followed by the 7-bit Unicode code for this symbol. So for the English alphabet, the UTF-8 encoding and ASCII codes are the same;

2 for the N-byte symbol, the first n bits of a byte are set to 1, the n+1 bit is set to 0, and the first two digits of the following bytes are set to 10. The remaining bits, expressed as the Unicode code for this symbol;

In addition, like ASCII is only used for English character encoding, BIG5 coding is a traditional encoding scheme in Taiwan and Hong Kong, although there are some flaws, but widely used in the computer industry, especially the Internet, thus becoming a de facto industry standard.

Summarize

ASCII is used to represent English characters, is represented by 7 digits, can represent 128 characters, and its extension uses 8-bit representations, representing 256 characters;

GB2312 Simplified Chinese encoding format, only supports 6,763 commonly used Chinese characters;

GBK is a GB2312 compatible GB2312 based on the expansion of the standard, including all Chinese characters, support Simplified Chinese and Traditional Chinese;

GBK versatility is worse than UTF8, but UTF8 occupies a larger database than GBK;

GB2312, GBK to GB18030 all belong to Double-byte character sets (DBCS);

From ASCII, GB2312, GBK to GB18030, these coding methods are backward-compatible, that is, the same character always has the same encoding in these scenarios, and the following standard supports more characters. In these codes, English and Chinese can be treated in a uniform manner. The method of distinguishing Chinese encoding is that the highest bit of high byte is not 0;

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.