Character encoding: ASCII, Unicode, UTF-8, gb2312
1. ASCII code
We know that in a computer, all information is eventually represented as a binary string. Each binary bit has two states: 0 and 1. Therefore, eight binary bits can combine 256 states, which is called a byte ). That is to say, a single byte can be used to represent 256 different States. Each State corresponds to one symbol, that is, 256 symbols, from 0000000 to 11111111.
In the 1960s s, the United States developed a set of character codes to define the relationship between English characters and binary characters. This is called ASCII code, which has been used till now.
The ASCII code consists of a total of 128 characters. For example, the space is 32 (Binary 00100000), and the uppercase letter A is 65 (Binary 01000001 ). These 128 symbols (including 32 control symbols that cannot be printed) only occupy the last seven digits of one byte, and the first one digit is set to 0.
2. Non-ASCII Encoding
It is enough to encode English with 128 symbols, but it is not enough to represent other languages. For example, if there is a phonetic symbol above a letter in French, it cannot be represented by ASCII code. As a result, some European countries decided to use the idle highest bit in the byte to encode the new symbol. For example, E in French is encoded as 130 (Binary 10000010 ). In this way, the encoding systems used by these European countries can represent a maximum of 256 symbols.
However, there are new problems. Different countries have different letters. Therefore, even if they all use 256 characters, they represent different letters. For example, 130 represents é in French encoding, but in Hebrew encoding represents the letter gimel (delimiter). In Russian encoding, it represents another symbol. However, in all these encoding methods, 0-represents the same symbol, but the difference is only the 128-255.
As for Asian countries, more characters are used, and about 0.1 million Chinese characters are used. A single byte can only represent 256 types of symbols. It must be expressed by multiple bytes. For example, the common encoding method for simplified Chinese is gb2312, which uses two bytes to represent a Chinese character. Therefore, it can theoretically represent a maximum of 256x256 = 65536 characters.
The issue of Chinese encoding needs to be discussed in a specific article. This note does not cover this issue. It is only pointed out that although multiple bytes are used to represent a symbol, the Chinese character encoding of the GB class has nothing to do with the Unicode and UTF-8 of the subsequent text.
Unicode character set (UCS), the International Standards Organization established the ISO/IEC JTC1/SC2/WG2 Working Group in April 1984 to uniformly encode texts and symbols of various countries. In 1991, a multinational company set up Unicode Consortium and reached an agreement with WG2 in October 1991 using the same encoding word set. Currently, Unicode uses a 16-bit encoding system. Its Character Set content is the same as that of BMP (Basic multilingual plane) of iso000046. Unicode passed the DIS (draf international standard) in June 1992. The current version is 1996, which contains 6811 symbols, 20902 Chinese characters, 11172 Korean pinyin characters, and 6400 word-building areas, 20249 retained, totaling 65534. The size after unicode encoding is the same. For example, if an English letter "a" and a Chinese character "good" is used, the occupied space after encoding is the same, both of which are two bytes!
Unicode can be used to indicate characters in all languages, and is a fixed-length dual-byte (also four-byte) encoding, including English letters. So it can be said that it is not compatible with iso8859-1 encoding, is not compatible with any encoding. However, compared to iso8859-1 encoding, uniocode encoding only adds a 0 byte before, for example, the letter 'A' is "00 61 ".
It should be noted that fixed-length encoding is easy for computer processing (note that gb2312/GBK is not fixed-length encoding), while Unicode can be used to represent all characters, therefore, many software programs use Unicode encoding, such as Java.
Unicode is, of course, a large collection. The current size can contain more than 1 million characters. Each symbol is encoded differently. For example, U + 0639 represents the Arabic letter ain, U + 0041 represents the English capital letter A, and U + 4e25 represents the Chinese character "strict ". You can query a specific symbol table at unicode.org or a special Chinese character table. Http://www.chi2ko.com/tool/CJK.htm
4. Unicode Problems
It should be noted that Unicode is only a symbolic set, which only specifies the binary of the symbolCodeBut does not specify how the binary code should be stored.
For example, the Unicode Character "strict" is a hexadecimal number of 4 E25, and the number of bytes converted to binary is 15 (100111000100101). That is to say, the representation of this symbol requires at least two bytes. It indicates other larger symbols. It may take 3 or 4 bytes, or even more.
There are two serious problems here. The first problem is, how can we distinguish Unicode and ASCII? How does a computer know that three bytes represent one symbol instead of three symbols? The second problem is that we already know that only one byte is enough for English letters. If Unicode is uniformly defined, each symbol is represented by three or four bytes, therefore, two to three bytes in front of each English letter must be 0, which is a huge waste for storage. Therefore, the size of the text file is two or three times larger, which is unacceptable.
The result is: 1) There are multiple Unicode storage methods, that is, there are many different binary formats that can be used to represent Unicode. 2) Unicode cannot be promoted for a long time until the emergence of the Internet.
With the popularity of the Internet, a unified encoding method is strongly required. UTF-8 is the most widely used Unicode implementation method on the Internet. Other implementations also include UTF-16 and UTF-32, but are basically not needed on the Internet. Repeat, the relationship here is that UTF-8 is one of the Unicode implementation methods.
The biggest feature of UTF-8 is that it is a variable length encoding method. It can use 1 ~ The four bytes indicate a symbol, and the length of the byte varies according to different symbols.
UTF-8 coding rules are very simple, only two:
1) for a single-byte symbol, the first byte is set to 0, and the last seven digits are the Unicode code of this symbol. Therefore, for English letters, the UTF-8 encoding and ASCII code are the same.
2) for the n-byte symbol (n> 1), the first N bits of the first byte are set to 1, and the N + 1 bits are set to 0, the first two bytes are set to 10. The remaining unmentioned binary bits are all Unicode codes of this symbol.
The following table summarizes the encoding rules. The letter X indicates the available encoding bits.
Unicode symbol range | UTF-8 encoding method
(Hexadecimal) | (Binary)
-------------------- + ---------------------------------------------
0000 0000-0000 007f | 0 xxxxxxx
0000 0080-0000 07ff | 110 XXXXX 10 xxxxxx
0000 0800-0000 FFFF | 1110 XXXX 10 xxxxxx 10 xxxxxx
0001 0000-0010 FFFF | 11110xxx 10 xxxxxx 10 xxxxxx 10 xxxxxx
Next, we take Chinese characters "strict" as an example to demonstrate how to implement UTF-8 encoding.
It is known that the Unicode of "strict" is 4e25 (100111000100101). According to the above table, we can find that 4e25 is in the range of the third row (0000-0800 FFFF ), therefore, the "strict" UTF-8 encoding requires three bytes, that is, the format is "1110 XXXX 10 xxxxxx 10xxxxxx ". Then, starting from the last binary bit of "strict", fill in X in the format from the back to the front, and fill the extra bit with 0. In this way, the "strict" UTF-8 code is "11100100 10111000 10100101", converted to hexadecimal is e4b8a5.
6. Conversion between Unicode and UTF-8
Through the example in the previous section, we can see that the "strict" Unicode code is 4e25, The UTF-8 code is e4b8a5, the two are different. The conversion between them can be achieved throughProgram.
On the Windows platform, there is one of the simplest transformations. Instead, you can use the built-in deployment mini-program notepad.exe. After opening the file, click the "Save as" command in the "file" menu to pop up a dialog box with a "encoding" drop-down at the bottom.
There are four options: ANSI, Unicode, Unicode big endian and UTF-8.
1) ANSI is the default encoding method. English files are ASCII encoded files, while simplified Chinese files are gb2312 encoded files (only for Windows Simplified Chinese versions, if they are traditional Chinese versions, big5 codes will be used ).
2) unicode encoding refers to the UCS-2 encoding method, that is, directly using two bytes into the character Unicode code. This option uses the little endian format.
3) Unicode big endian encoding corresponds to the previous option. In the next section, I will explain the meanings of little endian and big endian.
4) UTF-8 coding, that is, the encoding method mentioned in the previous section.
After selecting the encoding method, click the Save button to convert the file encoding method immediately.
7. little endian and big endian
As mentioned in the previous section, Unicode codes can be stored directly in UCS-2 format. Take the Chinese character "Yan" as an example. The Unicode code is 4e25 and needs to be stored in two bytes. one byte is 4E and the other byte is 25. During storage, 4e is in the front, 25 is in the back, that is, the big endian mode; 25 is in the front, and 4E is in the little endian mode.
These two odd names are from the English writer Swift's gulliver Travel Notes. In this book, a civil war broke out in the country of small people. The reason for the war was people's debate about whether to break out from big-Endian or from Little-Endian when eating eggs. There were six wars in front and back for this purpose. One emperor gave his life and the other emperor lost his throne.
Therefore, the first byte is in front of the "Big endian", and the second byte is in front of the "little endian ).
Naturally, a problem arises: how does a computer know which encoding method is used for a file?
As defined in the Unicode specification, each file is preceded by a character indicating the encoding order. The character is called "Zero Width, non-line feed space" (Zero Width, no-break space ), expressed in feff. This is exactly two bytes, and FF is 1 larger than Fe.
If the first two bytes of a text file are Fe ff, it indicates that the file adopts the big header mode. If the first two bytes are FF Fe, it indicates that the file adopts the Small Header mode.
The following is an example.
Open the program notepad.exe, create a text file, the content is a "strict" word, in turn using ANSI, Unicode, Unicode big endian and UTF-8 encoding to save.
Then, use the "hexadecimal function" in the text editing software ultraedit to observe the internal encoding mode of the file.
1) ANSI: The file encoding is the two-Byte "D1 CF", which is the "strict" gb2312 encoding, which also implies that gb2312 is stored in a large-headed manner.
2) UNICODE: the encoding is four bytes: "FF Fe 25 4E", where "ff fe" indicates that it is stored in a small header, and the actual encoding is 4e25.
3) Unicode big endian: the encoding format is four bytes: "Fe FF 4E 25". "Fe FF" indicates that it is stored as a large data source.
4) UTF-8: the encoding is six bytes "Ef bb bf E4 B8 A5", the first three bytes "Ef bb bf" indicates that this is UTF-8 encoding, the last three "e4b8a5" are "strict" encoding, and their storage sequence is consistent with the encoding sequence.
GB2312-80 "information exchange in Chinese character encoding character set basic set", released in 1980, is the national standard of Chinese Information Processing, in mainland China and overseas use of simplified Chinese areas (such as Singapore) it is the only Chinese encoding that is mandatory. P-Windows3.2 and Apple OS are to gb2312 as the basic Chinese character encoding, Windows 95/98 to GBK as the basic Chinese character encoding, But compatible with gb2312.
Range: a1a1 ~ Fefe
A1-A9: Symbol area, contains 682 symbols
B0-F7: Chinese Character area, containing 6763 Chinese Characters
Gb2312 (1980) contains a total of 7445 characters, including 6763 Chinese characters and 682 other symbols. The inner code range of the Chinese character area is high byte from the B0-F7, low byte from the A1-FE, the occupied bitwise of the Code is 72*94 = 6768. Five of them are D7FA-D7FE. The GB2312-80 contains a total of 7545 characters encoded in two bytes. Each character has a maximum of 0 characters. GB2312-80 code for short.
Gb2312 supports too few Chinese characters. The Chinese character extension specification gbk1.0 in 1995 contains 21886 characters, which are divided into Chinese Character areas and graphic symbol areas. The Chinese Character area contains 21003 characters.
In 1990, the traditional Chinese character encoding standard GB12345-90 "information exchange in Chinese character encoding Character Set first auxiliary set", the purpose is to standardize the use of traditional Chinese characters in various occasions, as well as ancient books. This standard contains a total of 6866 Chinese characters (more than 103 words than gb2312, most of the font libraries of other manufacturers do not include these words). There are about 2200 Chinese Characters in Traditional Chinese.
Range: a1a1 ~ Fefe
A1-A9: Symbol area, adding a vertical sign
B0-F9: Chinese Character area, containing 6866 Chinese Characters
GBK encoding (Chinese internal code specification) is a new Chinese encoding extended national standard developed in mainland China and equivalent to UCS. GBK encoding can be used to represent both traditional and simplified Chinese characters, while gb2312 can only represent simplified Chinese characters. GBK is compatible with gb2312 encoding. The GBK team completed the GBK specification on October 1995 and on December of the same year. This encoding standard is compatible with gb2312 and contains 21003 Chinese characters and 883 symbols. It also provides 1894 character-building characters, including simplified and traditional Chinese characters in a single library. In Windows 95/98, the Simplified Chinese version of the font surface encoding uses GBK, which is linked to the underlying font through a one-to-one correspondence between GBK and UCS.
English name: Chinese internal code specification
Chinese name: Chinese character internal code extension Specification Version 1.0
Dubyte encoding, GB2312-80 expansion, compatibility with GB2312-80 in code bit
Range: 8140 ~ Fefe (excluding xx7f) has a total of 23940 code bits
Contains 21003 Chinese characters, including all Chinese and Japanese characters in ISO/IEC 10646-1
* The absolute minimum every software developer absolutely, positively must know about Unicode and character sets (basic knowledge about character sets) http://www.joelonsoftware.com/articles/Unicode.html
* Talk About unicode encoding http://www.pconline.com.cn/pcedu/empolder/gj/other/0505/616631.html
* Rfc3629: UTF-8, a transformation format of ISO 10646 (if UTF-8 is implemented) http://www.ietf.org/rfc/rfc3629.txt