ASCII, gb2312, GBK, gb18030, Unicode, UTF-8 character set encoding details

Last Update:2018-12-07 Source: Internet

Author: User

Tags control characters

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

ASCII, gb2312, GBK, gb18030, Unicode, UTF-8 character set encoding details ASCII character set encoding

The ASCII code is a 7-bit code.The encoding range is 0x00-0x7f. The ASCII character set includes English letters, Arabic numerals, punctuation marks, and other characters. 0x00-0x20 and 0x7f contain 33 control characters.

The system that only supports ASCII Code ignores the maximum bit of each byte and considers the low 7 bits as the valid bit. Hz character encoding was designed to transmit Chinese characters in a 7-digit ASCII system. In the early days, many email systems only support ASCII encoding. To transmit Chinese emails, base64 or other encoding methods must be used.

Gb2312 character set encoding

Gb2312 is designed based on the location code. The encoding table is divided into 94 areas, each of which corresponds to 94 characters. The combination of the area code and the location code of each character is the location code of the Chinese character. Common location codes
It is represented by a 10-digit number. For example, if the value is 1601, it indicates 16-digit and 1-digit. The corresponding character is "ah ". Add 0xa0 to the area code and bit Code respectively to get gb2312 encoding.

In the location code, the 01-09 area is the symbol and number area, the 16-87 area is the Chinese character area, and the 10-15 and 88-94 areas are undefined blank areas. It divides the recorded Chinese characters into two levels: the first level is commonly used Chinese characters
3755 characters, arranged in the order of Chinese pinyin letters/PEN; the second-level Chinese characters are 3008 frequently used Chinese characters, placed in Area 56-87, arranged in the order of the Beginning/strokes. Level 1 Chinese Character
It is sorted by pinyin. In this way, we can get the range of a pinyin in the location of a level-1 Chinese character. Many of them can be obtained based on Chinese characters.ProgramIt is written based on this principle.

In addition to common simplified Chinese characters, the gb2312 Character Set also contains Greek letters, Japanese hirakana, Katakana letters, and Russian Spanish letters. You can use traditional Chinese characters to test whether some systems only support gb2312 encoding.

The encoding range of gb2312 is 0xa1a1-0x7e7e. After undefined areas are removed, the actual encoding range is 0xa1a1-0xf7fe.

The EUC-CN can be understood as an alias for gb2312, which is exactly the same as gb2312.

Location Code should be considered as the definition of the character set, defines the included characters and character location, while gb2312 and EUC-CN is the actual Computer The Environment supports encoding of this character set. Hz and ISO-2022-CN are two types of codes corresponding to the location code Character Set, both of which use a 7-bit encoding space to support Chinese characters. The relationship between the location code and the gb2312 code is a bit like Unicode and UTF-8.

GBK character set encoding

GBK encoding is a superset of gb2312 encoding, which is fully compatible with gb2312. At the same time, GBK contains all the CJK Chinese characters in the Unicode basic multilingual plane. Same
Like gb2312, GBK also supports Greek letters, Japanese Kana letters, Russian letters, and other characters, but does not support tabulation characters (non-Chinese characters) in Korean ). GBK also includes
The beginning and end of a Chinese character, vertical punctuation, and other characters.

The overall GBK encoding range is 0x8140-0xfefe, excluding a combination of 0 x 7f for low bytes. The high byte range is 0x81-0xfe, and the low byte range is 0x40-7e and 0x80-0xfe.

The GBK character with a low byte of 0x40-0x7e has some special characteristics, because these characters occupy the location of the ASCII code, which may cause some system troubles.

Some systems use characters (such as "|") in 0x40-0x7e as special symbols. When locating these symbols, the system does not determine whether these symbols belong to a specific
The low byte of the GBK character, which may cause incorrect judgment. This problem does not exist in environments that support gb2312. Note that the GBK-supported environment is smaller than a byte 0x80.
It is an ASCII symbol. In addition, it is best to use ASCII characters smaller than 0 × 40 for some special characters, so that you can quickly locate them without worrying about being the other half of a Chinese character. Big5 encoding is also
Problems exist.
Cp936 and GBK have some differences. In most cases, cp936 can be treated as the alias of GBK.

Gb18030 character set encoding

Gb18030 code is backward compatible with GBK and gb2312, Compatibility means not only character compatibility, but also the same character encoding. Gb18030 contains all the characters in unicode3.1, including Chinese Ethnic Minorities and Korean characters not supported by GBK. It can also be said that the text symbols of most nationalities in the world are included.

Both GBK and gb2312 are dual-byte width encoding. If it is considered to be single-byte compatible with ASCII, it can also be understood as a single-byte and dual-byte mixed variable-length encoding. The gb18030 encoding method is variable-length encoding, which can be single-byte, dual-byte, or four-byte.

Gb18030
The single-byte encoding range of is 0x00-0x7f, which is exactly the same as ASCII. The dual-byte encoding range is the same as that of GBK, the high byte is 0x81-0xfe, and the low Byte encoding range is 0x40.
-0x7e and 0x80-fe. In the four-byte encoding, the first and third bytes are 0x81-0xfe, and the second and fourth bytes are 0x30-0x39.

Windows
Cp936CodeThe page uses 0x80 to represent the euro symbol, while the 0x80 encoding bit is not used in gb18030 encoding, and other locations are used to represent the euro symbol. This can be understood
Gb18030 has a small problem in terms of downward compatibility. It can also be understood that 0x80 is an extension of cp936 to GBK, while gb18030 is only well compatible with GBK.

Unicode Character Set Encoding

Each Language
different coding pages increase the complexity of software that must support different languages. Therefore, we have developed a world standard called Unicode. Unicode is provided for each character
A unique value, regardless of the platform, software, or language. That is to say, all the characters used in the world are listed and each character is given a unique and specific value.

The original objective of Unicode is to use a 16-bit encoding to provide ing for over 65000 characters. However, this is not enough. It cannot cover all historical texts or solve the implantation head-ache problem, especially in network-based applications. Existing software must do a lot Work Program 16-bit data.
Because
Here, Unicode uses three encoding methods with some basic reserved characters. They are UTF-8, UTF-16, and UTF-32 respectively. As shown in the name, in the UTF-8, the character is
It is encoded in an 8-bit sequence and represents a character in one or several bytes. The biggest benefit of this approach is that the UTF-8 retains the ASCII character encoding as part of it, for example, in a UTF-8
In ASCII, the encoding of "a" is 0x41.

The UTF-16 and UTF-32 are Unicode 16-bit and 32-bit encoding methods, respectively. Given the initial purpose, Unicode is typically a UTF-16. When discussing Unicode, it is very important to determine which encoding method is used.

UTF-8 character set encoding

Unicode Transformation
Format-8bit that allows BOM inclusion, but typically does not. It is a multi-byte encoding for international characters. It uses 8 bits (one byte) for English and 24 for Chinese characters (three
Bytes. UTF-8 contains all the characters needed by all countries in the world, is an international code, universal. UTF-8-encoded text can be displayed on browsers that support utf8 character sets in various countries
. For example, if it is UTF-8 encoded, Chinese characters can also be displayed on Internet Explorer of foreigners. They do not need to download the Chinese language support package for Internet Explorer.

The GBK text encoding is expressed in double bytes, that is, both Chinese and English characters are expressed in Double Bytes. To distinguish Chinese characters, set the highest bit to 1. GBK contains all Chinese characters. It is a national code and has poor universality than utf8. However, the database occupied by utf8 is larger than that occupied by GBD.

The conversion between GBK, gb2312, and utf8 must be unicode encoded:

GBK, gb2312 -- Unicode -- utf8

Utf8 -- Unicode -- GBK, gb2312

For a website or forum, if there are many English characters, it is recommended to use UTF-8 to save space. However, many Forum plug-ins generally only support GBK.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More