ASCII, gb2312, GBK, Unicode, UTF-8 encoding range

Source: Internet
Author: User
Tags control characters

ASCII
The ASCII code is a 7-bit code with the encoding range of 0x00-0x7f. The ASCII character set includes English letters, Arabic numerals, punctuation marks, and other characters. 0x00-0x20 and 0x7f contain 33 control characters.
The system that only supports ASCII Code ignores the maximum bit of each byte and considers the low 7 bits as the valid bit. Hz character encoding was designed to transmit Chinese characters in a 7-digit ASCII system. In the early days, many email systems only support ASCII encoding. To transmit Chinese emails, base64 or other encoding methods must be used.
Gb2312
Gb2312 is designed based on the location code. The encoding table is divided into 94 areas, each of which corresponds to 94 characters. The combination of the area code and the location code of each character is the location code of the Chinese character. The location code is usually expressed in a 10-digit number. For example, if the value is 1601, it indicates a 16-digit one-digit location. The corresponding character is "ah ". Add 0xa0 to the area code and bit Code respectively to get gb2312 encoding.
In the location code, the 01-09 area is the symbol and number area, the 16-87 area is the Chinese character area, and the 10-15 and 88-94 areas are undefined blank areas. It divides the recorded Chinese characters into two levels: the first level is 3755 commonly used Chinese characters, which are placed in the 16-55 area and arranged in the order of Chinese pinyin letters/PEN; the second-level Chinese characters are 3008 frequently used Chinese characters, which are placed in Area 56-87 and arranged in sequence by the beginning/strokes. First-level Chinese characters are sorted by pinyin. This gives you the range of a pinyin In the first-level Chinese character location. Many programs that can obtain pinyin based on Chinese characters are compiled based on this principle.
In addition to common simplified Chinese characters, the gb2312 Character Set also contains Greek letters, Japanese hirakana, Katakana letters, and Russian Spanish letters. You can use traditional Chinese characters to test whether some systems only support gb2312 encoding.
The encoding range of gb2312 is 0xa1a1-0x7e7e. After undefined areas are removed, the actual encoding range is 0xa1a1-0xf7fe.
The EUC-CN can be understood as an alias for gb2312, which is exactly the same as gb2312.
The location code should be considered as the definition of Character Set, defining the included characters and character location, while gb2312 and EUC-CN are the encoding of this character set in the actual computer environment. Hz and iso-2022-cn are two types of codes corresponding to the location code character set. They both use a 7-bit encoding space to support Chinese characters. The relationship between the location code and the gb2312 code is a bit like Unicode and UTF-8.
GBK
GBK encoding is a superset of gb2312 encoding, which is fully compatible with gb2312. At the same time, GBK contains all the CJK Chinese characters in the Unicode basic multilingual plane. Like gb2312, GBK also supports Greek letters, Japanese Kana letters, Russian letters, and other characters, but does not support tabulation characters (non-Chinese characters) in Korean ). GBK also contains the Chinese radical and vertical punctuation characters not included in gb2312.
The overall GBK encoding range is 0x8140-0xfefe, excluding a combination of 0 x 7f for low bytes. The high byte range is 0x81-0xfe, and the low byte range is 0x40-7e and 0x80-0xfe.
The GBK character with a low byte of 0x40-0x7e has some special characteristics, because these characters occupy the location of the ASCII code, which may cause some system troubles.
Some systems use characters (such as "|") in 0x40-0x7e as special symbols. When locating these symbols, the system does not determine whether these symbols belong to a low byte of a gbk character, this will cause incorrect judgment. This problem does not exist in environments that support gb2312. It should be noted that the environment supporting GBK is smaller than a byte 0x80 and may not be ASCII characters. In addition, it is best to use ASCII characters smaller than 0x40 for some special characters, in this way, you can quickly locate a Chinese character without worrying about the other half of the Chinese character. Big5 encoding also has problems.
Cp936 and GBK have some differences. In most cases, cp936 can be treated as the alias of GBK.
Gb18030
The gb18030 encoding is backward compatible with GBK and gb2312. The compatibility meaning is not only compatible with characters, but also the same encoding for the same characters. Gb18030 contains all the characters in unicode3.1, including Chinese Ethnic Minorities and Korean characters not supported by GBK. It can also be said that the text symbols of most nationalities in the world are included.
Both GBK and gb2312 are dual-byte width encoding. If it is considered to be single-byte compatible with ASCII, it can also be understood as a single-byte and dual-byte mixed variable-length encoding. The gb18030 encoding method is variable-length encoding, which can be single-byte, dual-byte, or four-byte.
The single-byte encoding range of gb18030 is 0x00-0x7f, which is completely equivalent to ASCII. The dual-byte encoding range is the same as that of GBK, the high byte is 0x81-0xfe, and the low Byte encoding range is 0x40-0x7e and 0x; the first and third bytes of the Four-byte encoding range is 0x81-0xfe, and the second and fourth bytes are 0x30-0x39.
In Windows, the cp936 code page uses 0x80 to represent the euro, while the gb18030 Code does not use the 0x80 encoding bits. It uses other locations to represent the euro. This can be understood as a small problem in the downward compatibility of gb18030; it can also be understood that 0x80 is an extension of cp936 to GBK, while gb18030 is only well compatible with GBK.
Unicode
Different encoding pages for each language increase the complexity of software that needs to support different languages. Therefore, we have developed a world standard called Unicode.Unicode provides a unique value for each character.Regardless of the platform, software, or language. That is to say, all the characters used in the world are listed and each character is given a unique and specific value.
The original objective of Unicode is to use a 16-bit encoding to provide ing for over 65000 characters. However, this is not enough. It cannot cover all historical texts or solve the implantation head-ache problem, especially in network-based applications. The existing software must do a lot of work to program 16-bit data.
Therefore, Unicode uses three encoding methods with some basic reserved characters. They are UTF-8, UTF-16, and UTF-32 respectively. As the name suggests, in a UTF-8, a character is encoded in an 8-bit sequence and represents a character in one or several bytes. The biggest benefit of this approach is that the UTF-8 retains the ASCII character encoding as part of it, for example, in the UTF-8 and ASCII, "a" encoding is 0x41.
The UTF-16 and UTF-32 are Unicode 16-bit and 32-bit encoding methods, respectively. Given the initial purpose, Unicode is typically a UTF-16. When discussing Unicode, it is very important to determine which encoding method is used.
UTF-8
Unicode Transformation Format-8bit that allows BOM inclusion, but typically does not contain Bom. It is a multi-byte encoding for international characters. It uses 8 bits (one byte) for English and 24 bits (three bytes) for Chinese characters. UTF-8 contains all the characters needed by all countries in the world, is an international code, universal.

UTF-8-encoded text can be displayed on browsers that support utf8 character sets in countries. For example, if it is UTF-8 encoded, Chinese characters can also be displayed on Internet Explorer of foreigners. They do not need to download the Chinese language support package for Internet Explorer.
The GBK text encoding is expressed in double bytes, that is, both Chinese and English characters are expressed in Double Bytes. To distinguish Chinese characters, set the highest bit to 1. GBK contains all Chinese characters. It is a national code and has poor universality than utf8. However, the database occupied by utf8 is larger than that occupied by GBK.
The conversion between GBK, gb2312, and utf8 must be unicode encoded:
GBK, gb2312 -- Unicode -- utf8
Utf8 -- Unicode -- GBK, gb2312
For a website or forum, if there are many English characters, it is recommended to use UTF-8 to save space. However, many Forum plug-ins generally only support GBK.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.