, "Wide" is encoded as % B9 % E3. % B9 is called a section encoding, and % E3 is a character encoding (second encoding ).
Ideas:
Collect Chinese Character http://ff.163.com/newflyff/gbk-list/ from GBK encoding page
From a practical point of view, select only the section "● GBK/2: GB2312 Chinese characters", with a total of 3755 Chinese characters.
Look at the law: the section encoding from the B0-D7, and the Chinese character encoding from the A1-FE, that is, 16*6-2 = 94, very regular.
Step 1: extract commonly used Chinese characters in python and store them in a dictionary file in sequence. Chinese characters are separated by spaces.
Step 2: According to the encoding from the A1-FE, each section 94 Chinese characters, first positioning section encoding, using the position of Chinese characters in a section positioning character encoding
Implementation:
Step 1: extract Chinese CharactersCopy codeThe Code is as follows: with open ('e:/GBK.txt ') as f:
S = f. read (). splitlines (). split ()
The split list contains repeated section codes. To remove B0/B1 ...... Similar symbols and Chinese 0-9/A-F characters
Decode the obtained characters:
Delete these characters:
Decode all the split list first, and then
Copy codeThe Code is as follows: gbk. remove (U' \ uff10 ')
When the characters are deleted here, a series of strings are generated using range, and then processed using notepad ++. No simple method is found.Copy codeThe Code is as follows: for t in [U' \ uff10', U' \ uff11', U' \ uff12', U' \ uff13', U' \ uff14 ', u' \ uff15', U' \ uff16', U' \ uff17', U' \ uff18', U' \ uff19', U' \ uff21 ', u' \ uff22', U' \ uff23', U' \ uff24', U' \ uff25', U' \ uff26']:
Gbk. remove (t)
Then remove the B0-D7 such a section encoding, while extracting character encoding also need to use similar A1-FE such encoding, so you want to generate such a list, convenient to delete and index operations.
Generate encoding series:
The row encoding is 0-9 A-F, and the column encoding is A-F
Starting from A1 increments, encountering boundary (A9-AA) to be manually processed, using the ord () and chr () functions to convert between ASCII encoding and numbers.Copy codeThe Code is as follows: t = ['a1']
While True:
If t [-1] = 'fe ':
Break
If (ord (t [-1] [1])> = 48 and ord (t [-1] [1]) <57) or (ord (t [-1] [1])> = 65 and ord (t [-1] [1]) <70 ):
T. append (t [-1] [0] + chr (ord (t [-1] [1]) + 1 ))
Continue
If ord (t [-1] [1])> = 57 and ord (t [-1] [1]) <65:
T. append (t [-1] [0] + chr (65 ))
Continue
If ord (t [-1] [1])> = 70:
T. append (chr (ord (t [-1] [0]) + 1) + chr (48 ))
Continue
The list is as follows:
With this encoding sequence, you can delete B0-D7 characters from the gbk library.
Finally, check whether there are spaces not deleted. The unicode code of the space is \ u3000.
Gbk. remove (U' \ u3000 ')
Finally, encode is encoded into a UTF-8 and saved to the dictionary file.
I put this dictionary file on the online disk, external chain: http://dl.dbank.com/c0m9selr6h
Step 2: Index Chinese Characters
Indexing is a simple algorithm, because the man in the dictionary is stored in the original order, And the 3755 Chinese Characters in GBK encoding Table 2 strictly abide by the rules of 94 Chinese Characters in each section, then we can use a simple addition to integer plus 1 to locate the section encoding, and then use the Chinese Character index-section Index * 94 to obtain the index of Chinese characters in this section, then the second encoding is located using the A1-FE list and index generated above.
Algorithm ideas are available, encoding, and debugging
Python code and comments are attached:
Copy codeThe Code is as follows: def getGBKCode (gbkFile = 'e:/GBK1.1.txt ', s = ''):
# The gbkFile dictionary file contains 3755 Chinese Characters
# S is the Chinese character to be converted, and is currently gb2312 encoding, that is, the Chinese character encoding entered from IDLE
# Read dictionary
With open (gbkFile) as f:
Gbk = f. read (). split ()
# Generating index encoding for the A1-FE
T = ['a1']
While True:
If t [-1] = 'fe ':
Break
If (ord (t [-1] [1])> = 48 and ord (t [-1] [1]) <57) or (ord (t [-1] [1])> = 65 and ord (t [-1] [1]) <70 ):
T. append (t [-1] [0] + chr (ord (t [-1] [1]) + 1 ))
Continue
If ord (t [-1] [1])> = 57 and ord (t [-1] [1]) <65:
T. append (t [-1] [0] + chr (65 ))
Continue
If ord (t [-1] [1])> = 70:
T. append (chr (ord (t [-1] [0]) + 1) + chr (48 ))
Continue
# Indexing each Chinese Character in sequence
L = list ()
For st in s. decode ('gb2312 '):
St = st. encode ('utf-8 ')
I = gbk. index (st) + 1
# The Section encoding starts from B0 and obtains the section encoding of Chinese characters.
T1 = '%' + t [t. index ('b0'):] [I/94]
# Index number of Chinese characters in a node
I = I-(I/94) * 94
T2 = '%' + t [I-1]
L. append (t1 + t2)
# Finally, use spaces to separate outputs
Return ''. join (l)
I must admit that my python code is not so neat.
Attach my weibo ID: James Cooper