UTF-8, gb2312, gb18030, GBK and big5 character set encoding range of specific instructions

Source: Internet
Author: User

1. Prerequisites
1. character: the minimum unit of abstract text. It has no fixed shape (may be a font shape) and has no value. "A" is a character, and "€" (a symbol of the currency used by Germany, France, and many other European countries) is also a character. "China" and "country" are two Chinese characters. A character represents only one symbol and has no actual value significance.
2. character set: the character set is a collection of characters. For example, a Chinese character is the first invented character in Chinese. It is used in Chinese, Japanese, Korean, and Vietnamese. This also illustrates the relationship between characters and character sets, which comprise character sets (iso8859-1, gb2312/GBK, Unicode ).
3. Code point: each character in the character set is assigned to a "code point ". Each code point has a specific unique value, which is called a tag value. This scalar value is usually expressed in hexadecimal notation.
4. Code unit: In each encoding form, code points are mapped to one or more code units. "Code unit" is a single unit in each encoding method. The code unit size is equivalent to the number of digits in a specific encoding method:
UTF-8: code units in a UTF-8 are composed of eight digits; In a UTF-8, each code point is often mapped to multiple code units because of small code units. The code points are mapped to one, two, three, or four code units;
UTF-16: the code units in the UTF-16 are composed of 16 bits; the size of the UTF-16's code units is twice the size of the 8-bit code units. Therefore, code points with a scalar value less than U + 10000 are encoded into a single code unit;
UTF-32: the code units in the UTF-32 consist of 32-bit; the 32-bit code units used in the UTF-32 are large enough that each code point can be encoded as a single code unit;
In gb18030: gb18030, the code unit consists of eight digits. In gb18030, because the code unit is small, each code point is often mapped to multiple code units. Code points are mapped to one, two, or four code units.
5. Example:
"China Beijing banana is a big dumb" is the AKA character set defined by me. The corresponding code points of each character are:
North 00000001
Beijing 00000010
Xiang 10000001
Banana 10000010
Is 10000100
Maximum 10001000
10010000
Stupid 10100000
Eggs 11000000
Medium 00000100
Country 00001000
The following is the Zixia encoding scheme (8-bit) I have defined. We can see that its encoding represents the code units corresponding to all characters in the AKA character set;
North 10000001
Beijing 10000010
Xiang 00000001
Banana 00000010
Is 00000100
Maximum 00001000
00010000
Stupid 00100000
Eggs 01000000
Medium 10000100
Country 10001000
The so-called text file is a file in which binary data is represented as the corresponding text, such as 00000001000000100000010000001000000100000010000001000000, in a certain encoding mode. I open it in a notepad that supports Zixia encoding and AKA character sets. It is displayed as "banana is a big dumb" according to the encoding scheme"
If I save these characters to another file by GBK, it is definitely not this,
1100111111100011 1011110110110110 1100101011000111 1011100011110110 1011010011110011 1011000110111111 1011010110110000
Ii. Character Set
1. Common Character Set Classification
ASCII and its extended Character Set
Role: English of table language and Western European language.
Number of digits: ASCII is represented by 7 characters, which can represent 128 characters. Its Extension uses 8 characters to represent 256 characters.
Range: ASCII from 00 to 7f, extended from 00 to ff.
ISO-8859-1 Character Set
Role: Expanded ASCII, indicating Western Europe, Greek, etc.
Number of digits: 8,
Range: From 00 to FF, compatible with ASCII character sets.
Gb2312 Character Set
Role: Chinese Simplified Chinese Character Set, compatible with ASCII.
Number of digits: it is expressed in 2 bytes and can represent 7445 characters, including 6763 Chinese characters, covering almost all high-frequency Chinese characters.
Range: high byte from A1 to F7, and low byte from A1 to Fe. Encode the high byte and low byte with 0xa0 respectively.
Big5 Character Set
Purpose: unify traditional Chinese characters.
Number of digits: 2 bytes, indicating 13053 Chinese characters.
Range: high byte from A1 to F9, low byte from 40 to 7E, A1 to Fe.
GBK character set
Function: it is an extension of gb2312. It supports traditional Chinese characters and is compatible with gb2312.
Number of digits: 2 bytes, which can be 21886 characters.
Range: high byte from 81 to Fe, and low byte from 40 to Fe.
Gb18030 Character Set
Function: it solves Chinese, Japanese, and Korean encoding and is compatible with GBK.
Number of digits: It is represented by changing bytes (1 ASCII, 2, 4 bytes ). It can contain 27484 characters.
Range: 1 byte from 00 to 7f; 2 byte high byte from 81 to Fe, low byte from 40 to 7E and 80 to Fe; 4 byte first three byte from 81 to Fe, the second and fourth bytes are from 30 to 39.
UCs Character Set
Purpose: The International Standard ISO 10646 defines the universal character set ). It is compatible with organizations of the same type as Unicode, UCS-2 and Unicode.
Number of digits: it has two formats: UCS-2 and UCS-4, 2 bytes and 4 bytes respectively.
Range: currently, the UCS-4 is only added 0x0000 In Front Of The UCS-2.
Unicode Character Set
Role: Unified coding for the world's 650 languages, compatible with ISO-8859-1.
Number of digits: the Unicode Character Set has multiple encoding methods: UTF-8, UTF-16, and UTF-32.
2. sort by text
Formal name of the Language Character Set
English, Western European ASCII, ISO-8859-1 MBCS multibyte
Simplified Chinese gb2312 MBCS multibyte
Traditional Chinese big5 MBCS multi-byte
Simplified Chinese gbk mbcs multibyte
Gb18030 MBCS multi-byte Chinese, Japanese, and Korean
Unicode (single byte) in different languages.
Iii. Encoding
UTF-8: represented by variable-length bytes (1 ASCII, 2 Greek letters, 3 Chinese characters, 4-plane symbols), transmitted over the network, even if one byte is wrong, it does not affect other bytes, if either of the two bytes is incorrect, the others are also wrong, as shown in the following code:
If there is only one byte, its maximum binary bit is 0. If it is multiple bytes, its first byte starts from the highest bit, the number of consecutive binary values of 1 determines the number of encoded bytes, and the remaining bytes start with 10. The UTF-8 can be up to 6 bytes.
UTF-16: uses 2 bytes, and the characters in different parts of Unicode are also based on existing standards. This is to facilitate conversion. From 0x0000 to 0 x 007f is an ASCII character, from 0x0080 to 0 x 00FF is an extension of the ISO-8859-1 to ASCII. The Greek alphabet uses code from 0x0370 to 0 x 03ff, And the Slavic language uses code from 0x0400 to 0 x 04ff, the United States uses code from 0x0530 to 0 x 058f, and the Hebrew uses code from 0x0590 to 0 x 05ff. The hieroglyphics (CJK) in China, Japan, and South Korea occupy code from 0x3000 to 0 x 9fff; because 0x00 in C language and operating system file name has special significance, so in many cases need to save the UTF-8 encoding text, remove this 0x00. Example:
UTF-16: 0x0080 = 0000 0000 1000
UTF-8: 0xc280 = 1100 0010 1000 0000
UTF-32: 4 bytes.
Advantages and disadvantages
Both UTF-8, UTF-16, and UTF-32 can represent all Unicode characters in a valid encoding space (U + 000000-u + 10 FFFF.
When using UTF-8 encoding, ASCII characters only occupy 1 byte, high storage efficiency, suitable for occasions with more Latin characters to save space.
For most non-Latin characters (such as Chinese and Japanese), The UTF-16 requires the minimum storage space, and each character occupies only 2 bytes.
The Windows NT kernel is a Unicode (UTF-16) that uses UTF-16 encoding to call system APIs without conversion and is faster to process.
Using UTF-16 and UTF-32 has big endian and little endian, while UTF-8 has no bytecode problems, so UTF-8 is suitable for transmission and communication.
UTF-32 uses 4-byte encoding, on the one hand processing speed is relatively fast, but on the other hand also waste a lot of space, affecting the transmission speed, so seldom used.
4. How to judge character sets
1. byte order
First, let's talk about the effect of byte order on encoding. The byte order is divided into the big endian byte order and the little endian byte order. Different Processors may be different. Therefore, the processor must be informed of the encoding in the byte sequence during transmission. For the former, the high byte has a low address, and the low byte is saved in the high address; for the latter, the opposite is true. For example, 0x03ab,
Big endian byte order
0000: 0 3
0001: AB
The little endian byte order is
0000: AB
0001: 0 3
2. encoding and Recognition
UNICODE: determines the encoding of the Unicode Character Set Based on the first few bytes. It is called the byte order Mask Method BOM:
UTF-8: efbbbf (in UTF-8 format, please refer to above. However, it does not mean that it is in Unicode)
UTF-16 big endian: feff (no meaning in UCS-2)
UTF-16 little endian: fffe (no meaning in UCS-2)
UTF-32 big endian: 0000 feff (no meaning in UCS-4)
UTF-32 little endian: fffe0000 (no meaning in UCS-4)
Gb2312: The 1st bits of the high and low bytes are both 1.
Big5, GBK & gb18030: The 1st-bit high byte is 1. The operating system has the default encoding, which is usually GBK. You can download and upgrade other codes. Determine the 1st-bit high byte to know whether it is ascii or Chinese character encoding.
From [http://blog.minidx.com/2008/12/06/1689.html], I want to thank the original author

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.