Chinese character coding knowledge points
ASCII code is a western European code, the use of 7-bit encoding, so it is 2^7=128, a total of 128 conceited, including 34 characters, (such as line LF, enter CR, etc.), the remaining 94 are English letters and punctuation and arithmetic symbols. In the computer, an ASCII code occupies 8 bits, and the highest bit (BIT7) is used as a parity check. Odd Check rule: Correct code in one byte the number of 1 must be odd, if not odd, the highest bit b7 to fill a 1; Parity rule: Correct code one byte in the number of 1 must be even, if not even, the highest bit b7 to fill a 1.
In order to identify double-byte characters, such as kanji or Japanese Hangul are occupied 2 bytes, the high of each byte is 1, while the general fine-grained characters only one byte, seven-bit valid encoding, high-level for the complement, Usually 0. The Chinese character coding format is commonly referred to as the 10 format, with a kanji representing 2 bytes, but only one character.
First, GB2312 code
GB2312 A total of 6,763 Chinese characters, including the first-level Chinese characters 3,755, two-level kanji 3,008, where the first-level Chinese characters and two-grade Chinese characters are based on the location of the code to differentiate; also included are the Latin alphabet, Greek alphabet, Japanese hiragana and katakana letters, 682 fullwidth characters, including the Cyrillic alphabet in Russian.
GB2312 Location Code:
GB2312 the "distinguish" processing of the received Chinese characters, the U.S. region contains 94 characters/symbols. This representation is also known as the location code.
(1) zone 01-09 is a special symbol
(2) area 16-55 is a first-level kanji, sorted by pinyin
(3) zone 56-87 is a two-level kanji, sorted by radical/stroke
(4) Districts 10-15 and 88-94 are not coded
For example, the word "ah" is the first character in GB2312, and its location code is 1601. Byte encoding, usually using the EUC storage method, so that it is compatible with ASCII. Each character and symbol is represented by two bytes. The first byte is called the "high byte" and the second byte is called the "low byte". "High byte" uses 0xa1-0xf7 (the area code of Zone 01-87 plus 0xa0), "Low byte" uses 0xa1-0xfe (01-94 plus 0xa0). For example, in most programs, the word "ah" is stored in 0xb0a1 (compared to the location code: 0XB0=0XA0+16,0XA1=0XA0+1).
So the GB2312 code of the Chinese character area code in decimal is from 176 to 247, the bit code is from 161 to 255. The reason for storing 6763 is less than 82*94=6768 because the area code is 215, The bit code is 250-254 a total of five encodings without Chinese character coding, so 6768-5 = 6,763. its encoding range: A1a1-fefe, where the encoding range of Chinese characters is b0a1 (decimal 283159)-f7fe (decimal 406662), the first byte 0xb0-0xf7 (corresponding area code: 16-87), The second byte 0xa1-0xfe (corresponding bit number: 01-94).
GB2312 Encoding Rules:
(1) 2-byte encoding, high-0xa1-0xf7, low-0xa1-0xfe
(2) Chinese character area, high level is 0xb0-0xf7, Low is 0xa1-0xfe
(3) special symbol, high-0xa1-0xa9, low-0xa1-0xfe
Second, GBK code
GB2312 is a Chinese prescribed Chinese character coding, can also be said to be Simplified Chinese character set encoding, and GBK is GB2312 extension, in addition to compatible GB2312, but also includes traditional Chinese and Japanese kana and so on.
The encoding of the GBK simplified character set uses 1 bytes and 2 bytes at the same time, when the high position is 0x00~0x7f, it is a byte, and the high position is 0x80 above with 2 bytes. When a byte is found to be larger than 0x7f, it must be a kanji (with another byte pieced together into a Chinese character), 0x7f (01111111) The next number is 0x80 (10000000), so want to be greater than 0x7f, the highest bit of this byte is certainly 1, We need to judge whether the highest level is 1. For example: The ASCII code of A is (01100001), the ASCII code of A is (01000001).
Third, Unicode encoding
Unicode is a character encoding method designed by international organizations that can accommodate encoding schemes for all languages in the world. The study of Unicode is called "universalmultipleoctetcoded Character Set", referred to as UCS. Unicode specifies how to encode, but does not specify how to transfer and save the code. For example, the "Han" character of the UCS code is 6c49, we can use 4 ASCII numbers to transmit, save the code, you can also use UTF-8 encoding, 3 consecutive bytes e6b189 to represent him. The key is that both parties must endorse the communication. UTF-8, UTF-7 and UTF-16 are widely accepted programs. A special friend of UTF-8 is that it is fully compatible with iso-8859-1. UTF is the abbreviation for "Ucstransformation Format".
UTF-8 is a Unicode implementation, that is, its byte structure has special requirements, so we say that a Chinese character range is 0x4e00 to 0x9fa5, refers to the Unicode value, As for the utf-8 in the code to be organized by three bytes, so you can see that Unicode is given a range of characters, the definition of the word is how much code value, as to the specific implementation can be a variety to achieve. For the UTF-8 encoding of a character, if there is only one byte, its maximum bits is 0, if it is multibyte, its first byte starts at the highest bit, the number of consecutive bits values is 1 determines the number of digits encoded, and the remaining bytes begin with 10. The UTF-8 can be up to 6 bytes.
The Unicode range for Chinese characters is: 0x4e00~0x9fa5,utf-8 is somewhat similar to Haffman encoding, which is a variable-length encoding that minimizes the size of encoded bytes, and encodes Unicode as:
A 0x00-0x7f character, expressed in a single byte;
0x80-0x7ff characters are expressed in two bytes;
0x800-0xffff characters are expressed in 3 bytes;
continues to 6 bytes, and so on.
Unicode Symbol Range | UTF-8 Encoding method
(hex) | (binary)
--------------------+---------------------------------------------
0000 0000-0000 007F | 0xxxxxxx
0000 0080-0000 07FF | 110xxxxx10xxxxxx
0000 0800-0000 FFFF | 1110xxxx10xxxxxx 10xxxxxx
0001 0000-0010 FFFF | 11110xxx10xxxxxx 10xxxxxx 10xxxxxx
such as: "Strict" Unicode encoding is 4E25 (100111000100101), according to the table above, you can find 4E25 in the range of the third row (00000800-0000 FFFF), so "strict" The UTF-8 encoding requires three bytes, i.e. the format is "1110xxxx 10xxxxxx 10xxxxxx". Then, starting from the last bits of "Yan", the X in the format is filled in sequentially, and the extra bits complement 0. This gets, "strict" UTF-8 code is "1110010010111000 10100101", converted to 16 is E4B8A5.
So when we need to convert between GB2312, GBK, etc and UTF-8, we have to rely on Unicode code to achieve this.
GB2312,GBK-----Unicode-----UTF-8
UTF-8-------Unicode-----GB2312, GBK
References :
(1) Chinese Unicode Encyclopedia : http://mall.webcrow.jp/
Introduction to GB2312, GBK, Unicode, and UTF-8 encodings