Research on gb18030 encoding and Unicode ing between GBK, gb18030 and Unicode
Gb18030 has two versions: GB18030-2000 and GB18030-2005. In this article, the version gb18030 without specifying is the GB18030-2005. This article discusses the following issues:
- Gb2312 has 682 graphical symbols, all of which are placed in area 1. There are 717 graphical symbols in area 1 of GBK and 166 graphical symbols in Area 5, with a total of 883 graphical symbols. In gb18030, there are 728 graphical symbols in area 1 and 166 in Area 5. So what 35 Symbols are added to the GBK Zone 1 on the basis of gb2312? What symbols are added to gb18030?
- GBK supports 21003 Chinese characters and 883 graphical symbols, totaling 21886 characters. What are the 21886 characters? How does the encoding of these 21886 characters change in gb18030?
- How does gb18030 map all 0x110000 Unicode code bits?
- What is the difference between GB18030-2000 and GB18030-2005 in Word sink, what is the difference in encoding?
- The GB18030-2005 has 2067 bitwise pairs mapped to the PUA of Unicode BMP in the dubyte zone. What are the rules of these bitwise? How many characters are defined in these bitwise? In fact, only 24 characters are defined in the 2067 characters.
- 95 of the 21886 characters in GBK are mapped to the PUA of Unicode BMP. In gb18030, what are the changes in the encoding of these 95 characters? Which characters keep the original encoding?
- How many of the 23940 bits of GBK are mapped to the PUA of Unicode BMP? In gb18030, what are the changes in the Code bit encoding?
Before discussing these issues, we should first define the representation of bitspaces.
0-bit space 0.1 Convention
GBK is double-byte encoding. Each character is expressed in two bytes. Gb18030 is a multi-byte character set. Its characters can be expressed in one, two, or four bytes. The bitwise space is determined by the byte range. For example, the four-byte bytecode space of gb18030 is:
- The first byte is 0x81 ~ Between 0xfe
- The second byte ranges from 0x30 ~ Between 0x39
- The third byte ranges from 0x81 ~ Between 0xfe
- The fourth byte ranges from 0x30 ~ Between 0x39
For ease of expression, we use 0x81308130 ~ 0xfe39fe39 indicates the bitwise space. That is to say: in this article, 0x81308130 ~ 0xfe39fe39NoIt is a consecutive 2097773834 bytes (0xfe39fe39fe39 + 1) from 0x81308130 to 0xfe39fe39-0x81308130 + 1. In this article, 0x81308130 ~ 0xfe39fe39 refers to the bitwise space of each encoded byte in the corresponding range. The bitwise space of this bitwise space is:
(0xfe-0x81 + 1) * (0x39-0x30 + 1) * (0xfe-0x81 + 1) * (0x39-0x30 + 1) = 126*10*126*10 = 1587600
Likewise, 0xb0a1 ~ 0xf7fe indicates that the first byte is in 0xb0 ~ Between 0xf7 and the second byte is 0xa1 ~ All code BITs between 0xfe. The number of bitwise spaces is:
(0xf7-0xb0 + 1) * (0xfe-0xa1 + 1) = 72*94 = 6768
The bitwise space is the two zones of GBK and gb18030, and 6768 characters are defined in the 6763 bitwise.
Used in this article ~ It indicates the space of the above bitwise AND uses-to represent the general range, that is:
- 0xa1a1 ~ 0xa9fe indicates that the first byte is between 0xa1 and 0xa9, and the second byte is between 0xa1 and 0xa9 ~ The 846 (0xa9-0xa1 + 1) * (0xfe-0xa1 + 1) = 9*94) bitwise between 0xfe.
- 0xe000-0xf8ff indicates the consecutive 6400 (0xf8ff-0xe000 + 1) code bits from 0xe000-0xf8ff.
0.2 exercise
If you have understood the above conventions, complete the following two exercises:
- Exercise 1: Calculate the bitwise space 0x8140 ~ 0xfe7e.
- Exercise 2: Evaluate the bitwise space 0x8180 ~ The number of bits of 0xfefe.
0.3 answer
The following is the answer to exercise 0.2:
- Exercise 1: (0xfe-0x81 + 1) * (0x7e-0x40 + 1) = 126*63 = 7938
- Exercise 2: (0xfe-0x81 + 1) * (0xfe-0x80 + 1) = 126*127 = 16002
The space of the gb18030 double byte characters is 0x8140 ~ 0xfe7e and 0x8180 ~ 0 xfefe, the number of double byte characters is 7938 + 16002 = 23940. 0x8140 ~ 0xfe7e and 0x8180 ~ 0xfefe is also the space of all bits of GBK. GBK defines 23940 characters in these 21886 characters.
1 GBK review 1.1 Introduction
GBK is a dual-byte encoding scheme. Its bitwise space is 0x8140 ~ 0xfe7e and 0x8180 ~ 0 xfefe, a total of 23940 code bits. The 23940 characters are defined, including 21886 Chinese characters and 21003 graphical symbols. The 21003 Chinese Characters in UNICODE, gb2312, GBK, and gb18030 are described in detail. In section 3rd of this article, we will discuss the graphical symbols of gb2312, GBK, and gb18030.
The bitwise space of GBK can be divided into the following areas:
Category |
Zone name |
Code bit range |
Number of digits |
Character Count |
Symbol Area |
Zone 1 |
0xa1a1 ~ 0xa9fe |
846 |
717 |
Zone 5 |
0xa840 ~ 0xa97e and 0xa880 ~ 0xa9a0 |
192 |
166 |
Chinese character Area |
Zone 2 |
0xb0a1 ~ 0xf7fe |
6768 |
6763 |
Zone 3 |
0x8140 ~ 0xa07e and 0x8180 ~ 0xa0fe |
6080 |
6080 |
Zone 4 |
0xaa40 ~ 0xfe7e and 0xaa80 ~ 0xfea0 |
8160 |
8160 |
User-defined area |
User zone 1 |
0xaaa1 ~ 0 xaffe |
564 |
|
User Zone 2 |
0xf8a1 ~ 0 xfefe |
658 |
|
User area 3 |
0xa140 ~ 0xa77e and 0xa180 ~ 0xa7a0 |
672 |
|
Ing between 1.2 GBK characters and Unicode
I created an Excel file: attachment 1. This file contains three tables:
- All the 21886 characters in the GBK sorted by GBK encoding are listed in the Code table. This table contains three columns: character, GBK encoding, and Unicode encoding.
- The full 21886-character code table of GBK sorted by Unicode encoding. This table contains three columns: character, Unicode, and GBK encoding.
- From tables sorted by Unicode encoding, it is easy to find characters mapped to Pua (0xe000-0xf8ff. Of the 21886 characters in GBK, 95 characters belong to Pua. The third table lists the GBK encoding (column B) and Unicode encoding (column C) of the 95 characters (column) and the Unicode encoding (column D) corresponding to these characters in gb18030 ).
Column D may not be easy to understand. Let me explain it again. Gb18030 is compatible with GBK, so the GBK encoding of these characters is the same as that of gb18030. For example, the GBK encoding and gb18030 encoding are both 0xa8bf. However, in GBK and gb18030, It is mapped to different Unicode code bits. In GBK, 0xa8bf is mapped to 0xe7c8 of Unicode. In Unicode, 0xe7c8 is a PUA code bit, which is reserved for users. In gb18030, 0xa8bf is mapped to 0x01f9 of Unicode. In Unicode, the Code 0x01f9 belongs to the block "Expansion of Latin letters-B". The code bit defines the character "N, a lowercase Latin letter with an alarm character", and the font is.
Ing between 1.3 GBK bitwise AND Unicode
The 23940 bitwise of GBK defines 21886 characters, and there are also 23940-21886 = 2054 idle bitwise, these 2054 bitwise are mapped to the Unicode Pua. When GBK is designed, 95 of the 21886 characters in GBK do not have corresponding characters in UNICODE, so these 95 characters are also mapped to the PUA of Unicode. Among the 23940 bits of GBK, a total of 2054 + 95 = 2149 bits are mapped to Pua, and the corresponding PUA code is 0xe000-0xe864. 0xe000-0xe864 is a 2149-bit code. The allocation of these 2149 bitwise has the following rules:
Location |
Number of bitwise |
Pua range mapped |
User area 1: 0xaaa1 ~ 0 xaffe |
564 |
0xe000-0xe233 |
User Zone 2: 0xf8a1 ~ 0 xfefe |
658 |
0xe234-0xe4c5 |
User area 3: 0xa140 ~ 0xa77e & A180-A7A0 |
672 |
0xe4c6-0xe765 |
170 free code bits in the symbol area (Area 1 and area 5) |
170 |
0xe766-0xe80f |
Five free code bits in area 2: 0xd7fa-0xd7fe |
5 |
0xe810-0xe814 |
Zone 4's 80 Unicode characters that were not defined at the time: FE50-FE7E and FE80-FEA0 |
80 |
0xe815-0xe864 |
Appendix 2 contains two tables:
- Ing between 23940 GBK code bits and Unicode. The two groups of data are sorted by GBK and Unicode respectively.
- The 2149 code bits mapped to PUA are arranged in Unicode order.
2 Overview of gb18030 encoding 2.1
Gb18030 is a multi-byte character set. Its characters can be expressed in one, two, or four bytes. The gb18030 bitwise is defined as follows:
Bytes |
Code Space |
Number of digits |
Character Count |
Single byte |
0x00 ~ 0x7f |
128 |
128 |
Double Byte |
0x8140 ~ 0xfe7e and 0x8180 ~ 0 xfefe |
23940 |
21897 |
Four bytes |
0x81308130 ~ 0xfe39fe39 |
1587600 |
54531 |
Gb18030 has 128 + 23940 + 1587600 = 1611668 digits. The number of Unicode code bits is 0x110000 (1114112), less than gb18030. Therefore, gb18030 has enough space to map all Unicode code bits.
The 1611668 bitwise of gb18030 currently defines 128 + 21897 + 54531 = 76556 characters. Unicode 5.0 defines 99089 characters.
2.2 design ideas
The gb18030 encoding can be divided into the single-byte part, dual-byte part, and four-byte part. The single-byte part is exactly the same as the Unicode 0x00-0x7f. There are two differences between the dual-byte and GBK:
- 11 characters are added in area 1. In this way, the zone 1 has 717 + 11 = 728 characters. 11 characters are added: a euro (0xa2e3) and 10 vertical punctuation marks (0xa6d9-0xa6df, 0xa6ec-0xa6ed, and 0xa6f3 ).
- Some Characters mapped to PUA because Unicode is not included are included in the new version of Unicode. Therefore, these characters are mapped to non-PUA characters.
Unicode BMP has a total of 65536 code bits. The proxy area (0xd800-0xdfff) has 2048 code bits, and these 2048 code bits cannot be defined. The single-byte part of gb18030 maps 128 bitwise, And the dubyte part of gb18030 maps 23940 bitwise. The remaining 65536-2048-128-23940 = 39420 yards.
Gb18030 maps the 39420 bitwise order to the bitwise space starting from 0x81308130. Gb18030 maps the 16 secondary planes of Unicode (0x0000-0x10ffff, a total of 1048576 code bits) to the bitwise space starting from 0x90308130. Only the characters in the four-byte section of gb18030 are defined. The rest space is reserved and custom. The dual-byte and four-byte sections of gb18030 are also discussed in detail in sections 3rd and 4th.
The design concept of gb18030 can be summarized as follows:
- The single-byte part is consistent with the Unicode part.
- The dual-byte part is compatible with GBK. Adjust the ing between some characters and Unicode. These characters were originally mapped to PUA because Unicode is not included, and are now adjusted to a non-PUA Unicode code bit because Unicode has been included.
- Map the 39420 bitwise order of the Unicode BMP part that has not been mapped to the four-byte part starting from 0x81308130.
- Map 16 secondary planes other than Unicode BMP to the four-byte part starting from 0x90308130 in the order of 39420 bitwise.
Of the 76556 characters currently defined in gb18030, only 24 characters are defined in the PUA area of Unicode. The 24 characters contain 10 vertical punctuation marks (0xa6d9-0xa6df, 0xa6ec-0xa6ed, and 0xa6f3) in area 1) and 14 characters in Area 4 (0xfe51, 0xfe52, 0xfe53, 0xfe59, 0xfe61, 0xfe66, 0xfe67, 0xfe6c, 0xfe6d, 0xfe76, 0xfe7e, 0xfe90, 0xfe91, 0xfea0 ). The 14 Chinese Characters in Area 4 can also be found in Unicode 5.0 for non-PUA encoding. For details, see Chinese Characters in UNICODE, gb2312, GBK, and gb18030. However, according to gb18030, they should still be mapped to the PUA code bit.
2.3 differences between GB18030-2000 and GB18030-2005 and later versions
The coding architecture of GB18030-2005 and GB18030-2000 is exactly the same. GB18030-2005 has the following major changes relative to GB18030-2000:
- In the four-byte simplified table, add the CJK Unified Chinese Character Extension B and the glyphs of Chinese minority characters that have been encoded in gb13000. In fact, the GB18030-2000 has mapped these bitwise, but the GB18030-2000 does not give these characters in the font.
- Adjust the character encoding.
The encoding adjustment is interesting. The gb18030 encoding is 0xa8bc, and the Unicode 5.0 encoding is 0x1e3f. In the GB18030-2000, 0xa8bc is mapped to 0xe7c7 of Unicode, because the dual-byte part is not mapped to 0x1e3f, so it is placed as a non- ing character of BMP to 0x8135f437 of the Four-byte part. The GB18030-2005 maps 0xa8bc to 0x1e3f, so what should I do with the Unicode code bit 0xe7c7? To minimize the impact on the original encoding, the designer maps the Unicode code bit 0xe7c7 to the 0x8135f437 originally mapped to 0x1e3f.
Gb18030 has mapped all Unicode code bits. Therefore, no matter how Unicode changes, gb18030 only adds some glyphs to the current code bit, And the encoding will not change. Only the 24 characters mapped to PUA may be changed to non-PUA characters. The adjustment method should be the same as the adjustment method.
2.4 gb18030 dual-byte Section
We have already introduced the difference between the gb18030 dual-byte part and the GBK part. This section introduces some details. As mentioned earlier, gb18030 maps all the bitwise of Unicode except the proxy area. Therefore, the 6400 PUA code bits of Unicode BMP have corresponding code bits in gb18030. The gb18030 dual-byte part maps 2067 PUA code bits.
As mentioned above, GBK maps 2149 PUA code bits. Currently, the gb18030 dual-byte part is mapped to 2067 PUA code bits. Therefore, the ing of 2149-2067 = 82 characters has changed. GBK originally had 95 characters mapped to Pua, where 81 characters were mapped to non-PUA characters in gb18030. The remaining 14 Chinese characters are those mentioned in UNICODE, gb2312, GBK, and gb18030 (0xfe51, 0xfe52, 0xfe53, 0xfe59, 0xfe61, 0xfe66, 0xfe67, 0xfe6c, 0xfe6d, 0xfe76, 0xfe7e, 0xfe90, 0xfe91, 0xfea0 ). Appendix 1 lists the encoding changes of these characters. In addition to the 81 bitwise changed by the 82 ing, there is also the euro symbol: gb18030 encoding is 0xa2e3 and Unicode encoding is 0x20ac. The code bit 0xa2e3 is mapped to 0xe76c in GBK. the GBK code bit 0xa2e3 has no defined characters.
The ing between the dual-byte and Unicode of gb18030 is irregular. You can only map the two-byte through the lookup method.
2.5 gb18030 four-byte Section
For the characters in the gb18030 4-byte section, see the GB18030-2005's "Table 3 4-byte Section code bit Arrangement", a total of 54531 characters. The bitwise of the gb18030 4-byte part can be seen in the GB18030-2005's "7.3 4-byte part character arrangement order ". The characters are defined in only two regions:
- Gb18030 code bit 0x81308130 ~ 0x8439fe39: A total of 50400 Unicode BMP code bits are mapped to the standard single-byte and double-byte parts that have not been mapped.
- Gb18030 code bit 0x90308130 ~ 0xe339fe39 a total of 1058400 code bits map Unicode 16 secondary plane (PLANE 1 to plane 16) 65536*16 = 1048576 code bits.
For the sake of convenience, 0x81308130 ~ 0x8439fe39 is called the "BMP extension part ~ 0xe339fe39 is called the secondary plane part ". The bitwise space of the Four-byte gb18030 part is 0x81308130 ~ 0xfe39fe39. The second byte has (0x39-0x30 + 1) = 10 possible values. The third byte contains (0xfe-0x81 + 1) = 126 possible values. The fourth byte is also (0x39-0x30 + 1) = 10 possible values. To facilitate the calculation below, this article defines several terms for this bitwise space:
- The region with the first byte in the four-byte space is calledLevel 1 zone. Each level-1 zone has 12600 yards, that is, 10*126*10.
- The first and second bytes in the four-byte space are calledLevel 2 Zone. Each second-level zone has 1260 code bits, namely, 126*10.
- The region with the first three bytes in the four-byte space is calledLevel 3Each third-level zone has 10 bitwise codes.
A total of four bytes (0xfe-0x81 + 1) = 126 primary partitions. The BMP extension has four primary partitions. The secondary plane has 84 primary zones. There are still 38 primary zones that are reserved or custom zones.
2.5.1 BMP Extension
The BMP extension occupies the four first-level zones starting with the four-byte part, with a total of 4*12600 = 50400 code bits. The Unicode ing of this space is still very simple, that is, the sequential ing of single-byte and double-byte BMP code bits that have not been mapped. These mappings are determined in the GB18030-2000. Subsequent adjustments (for example) are only individual characters and do not affect the positions of other characters. However, because there are no regular rules on the BMP bit mapped by double-byte characters, the Unicode ing of the BMP extended part cannot be converted using the formula. We still need to look up the table.
Obviously, only 50400 code bits are used in the 39420 code bits, and the remaining code bits are retained. For fun, let's calculate the position of the last non-reserved code bit (0 xFFFF). The calculation process is as follows:
- M1 = (39420-1)/12600 = 3
- N1 = (39420-1) % 12600 = 1619
- M2 = N1/1260 = 1619/1260 = 1
- N2 = N1 % 1260 = 1619% 1260 = 359
- M3 = n2/10 = 359/10 = 35
- N3 = N2. % 10 = 359% 10 = 9
- The first byte is: 0x81 + M1 = 0x81 + 3 = 0x84
- The second byte is: 0x30 + m2 = 0x30 + 1 = 0x31
- The third byte is: 0x81 + m3 = 0x81 + 35 = 0xa4
- The fourth byte is: 0x30 + N3 = 0x30 + 9 = 0x39
Therefore, the gb18030 code bit mapped to Unicode code 0xffff is 0x8431a439. In the BMP Extension Section, the bits after 0x8431a439 are reserved. In the preceding calculation,/indicates the entire Division (for example, 5/3 = 1), and % indicates the remainder (for example, 5% 3 = 2 ).
2.5.2 secondary plane
84 primary zones (0x90308130 ~ 0xe339fe39) ing Unicode to 16 secondary planes. These mappings can be directly calculated using formulas. Let's see how to calculate it.
- The Unicode ing method from unicode encoding to gb18030 encoding is as follows:
- U = Unicode-0x10000
- M1 = u/12600
- N1 = u % 12600
- M2. = N1/1260
- N2 = N1 %1260
- M3 = n2/10
- N3 = n2 % 10
- First byte b1 = m1 + 0x90
- Second byte b2 = m2 + 0x30
- Third byte B3 = m3 + 0x81
- Fourth byte B4 = N3 + 0x30
According to the above method, we can calculate that 0x10ffff is mapped to 0xe3329a35. In the secondary plane, the bitwise after 0xe3329a35 is reserved. The algorithms written above can be easily written into C/C ++ code. For non-programmers, you can also use the Excel formula for calculation. Assume that Unicode encoding is placed in cell A12. The calculation method is as follows:
- Place M1 on B12, B12 = int (hex2dec (A12)-65536)/12600)
- Place N1 in C12, C12 = Mod (hex2dec (A12)-65536), 12600)
- Place m2 in D12, D12 = int (C12/1260)
- Place N2 in E12, E12 = Mod (C12, 1260)
- Place m3 in F12, F12 = int (E12/10)
- Place N3 in G12, G12 = Mod (E12, 10)
- Place the first byte in H12, H12 = dec2hex (B12 + 144)
- Place the second byte in i12, i12 = dec2hex (D12 + 48)
- Place the third byte in J12, J12 = dec2hex (F12 + 129)
- Place the fourth byte in K12, K12 = dec2hex (G12 + 48)
In Attachment 3, an Excel table with the preceding formula is provided. To use the functions hex2dec/dec2hex, You need to hook the "analysis tool library" with "tool-> load macro ".
- The gbing method from gb18030 encoding to unicode encoding is as follows:
- Set the four bytes encoded in gb18030 to B1, B2, B3, and B4 in sequence.
Unicode encoding = 0x10000 + (b1-0x90) * 12600 + (b2-0x30) * 1260 + (b3-0x81) * 10 + b4-0x30
If B1, B2, B3, and B4 are placed in A4, B4, C4, and D4, and Unicode encoding is placed in E4, the Excel Formula is as follows:
- E4 = dec2hex (hex2dec (A4)-144) * 12600 + (hex2dec (B4)-48) * 1260 + (hex2dec (C4)-129) * 10 + (hex2dec (D4)-48) + 65536)
2.6 gb18030 and Unicode ing tables
Appendix 3 provides a gbing table between gb18030 and Unicode. This Excel file was created based on the ing table of the netizen Mr. Xie zhenbin. It contains three tables:
- Ing between the 23940 bitwise of the double byte and Unicode. The two groups of data are sorted by gb18030 and Unicode respectively.
- In the BMP Extension Section, the ing between the 39420 bitwise AND Unicode is performed. The two groups of data are sorted by gb18030 and Unicode respectively.
- The ing formula of the secondary plane, gb18030 encoding and Unicode encoding.
3. graphical symbols in gb2312, GBK, and gb18030
During the study of gb18030 encoding, I sorted out the graphic symbols of gb2312, GBK, and gb18030 in area 1 and Area 5, and made Attachment 4. This Excel file contains three tables:
- The partition 1 tables of gb2312. The partition 1 and partition 5 lists of GBK and gb18030. The 35 characters added by GBK and 11 characters added by gb18030 are marked in different colors.
- The encoding of 682 symbols in gb2312 1.
- The encoding of 717 symbols in GBK 1.
Conclusion
After introducing this article, can you answer the question at the beginning?
Whether it is Windows XP or Vista, the default code page corresponding to the Chinese (China) region or GBK. We can only set the region, and cannot set the default code page for the region. In the Windows world, as long as Microsoft does not want to, gb18030 is just a common code page. Currently, simplified Chinese documents are mainly encoded in Unicode and GBK. There should be no documents saved in gb18030. This article only studies gb18030 encoding out of programmer's curiosity, hoping to help readers who are equally curious.