Today, I want to use python to call the search results calculated by Baidu box. I can see that the Chinese characters in the URL are encoded in GBK. although I can add Chinese characters directly to the URL, I have also done a python function for converting Simplified Chinese characters to GBK codes, but it is still a little troublesome. I changed it today.
, "Wide" is encoded as % B9 % E3. % B9 is called a section encoding, and % E3 is a character encoding (second encoding ).
Ideas:
Collect Chinese character http://ff.163.com/newflyff/gbk-list/ from GBK encoding page
From a practical point of view, select only the section "● GBK/2: GB2312 Chinese characters", with a total of 3755 Chinese characters.
Look at the law: the section encoding from the B0-D7, and the Chinese character encoding from the A1-FE, that is, 16*6-2 = 94, very regular.
Step 1: extract commonly used Chinese characters in python and store them in a dictionary file in sequence. Chinese characters are separated by spaces.
Step 2: according to the encoding from the A1-FE, each section 94 Chinese characters, first positioning section encoding, using the position of Chinese characters in a section positioning character encoding
Implementation:
Step 1: extract Chinese characters
The code is as follows:
With open ('E:/GBK.txt ') as f:
S = f. read (). splitlines (). split ()
The split list contains repeated section codes. to remove B0/B1 ...... Similar symbols and Chinese 0-9/A-F characters
Decode the obtained characters:
Delete these characters:
Decode all the split list first, and then
The code is as follows:
Gbk. remove (u' \ uff10 ')
When the characters are deleted here, a series of strings are generated using range, and then processed using notepad ++. no simple method is found.
The code is as follows:
For t in [u' \ uff10', u' \ uff11', u' \ uff12', u' \ uff13', u' \ uff14', u' \ uff15 ', u' \ uff16', u' \ uff17', u' \ uff18', u' \ uff19', u' \ uff21', u' \ uff22 ', u' \ uff23', u' \ uff24', u' \ uff25', u' \ uff26']:
Gbk. remove (t)
Then remove the B0-D7 such a section encoding, while extracting character encoding also need to use similar A1-FE such encoding, so you want to generate such a list, convenient to delete and index operations.
Generate encoding series:
The row encoding is 0-9 A-F, and the column encoding is A-F
Starting from A1 increments, encountering boundary (A9-AA) to be manually processed, using the ord () and chr () functions to convert between ASCII encoding and numbers.
The code is as follows:
T = ['A1']
While True:
If t [-1] = 'Fe ':
Break
If (ord (t [-1] [1])> = 48 and ord (t [-1] [1]) <57) or (ord (t [-1] [1])> = 65 and ord (t [-1] [1]) <70 ):
T. append (t [-1] [0] + chr (ord (t [-1] [1]) + 1 ))
Continue
If ord (t [-1] [1])> = 57 and ord (t [-1] [1]) <65:
T. append (t [-1] [0] + chr (65 ))
Continue
If ord (t [-1] [1])> = 70:
T. append (chr (ord (t [-1] [0]) + 1) + chr (48 ))
Continue
The list is as follows:
With this encoding sequence, you can delete B0-D7 characters from the gbk Library.
Finally, check whether there are spaces not deleted. the unicode code of the space is \ u3000.
Gbk. remove (u' \ u3000 ')
Finally, encode is encoded into a UTF-8 and saved to the dictionary file.
I put this dictionary file on the online disk, external chain: http://dl.dbank.com/c0m9selr6h
Step 2: Index Chinese characters
Indexing is a simple algorithm, because the man in the dictionary is stored in the original order, and the 3755 Chinese characters in GBK encoding table 2 strictly abide by the rules of 94 Chinese characters in each section, then we can use a simple addition to integer plus 1 to locate the section encoding, and then use the Chinese character Index-Section Index * 94 to obtain the index of Chinese characters in this section, then the second encoding is located using the A1-FE list and Index generated above.
Algorithm ideas are available, encoding, and debugging
Python code and comments are attached:
The code is as follows:
Def getGBKCode (gbkFile = 'E:/GBK1.1.txt ', s = ''):
# The gbkFile dictionary file contains 3755 Chinese characters
# S is the Chinese character to be converted, and is currently gb2312 encoding, that is, the Chinese character encoding entered from IDLE
# Read Dictionary
With open (gbkFile) as f:
Gbk = f. read (). split ()
# Generating index encoding for the A1-FE
T = ['A1']
While True:
If t [-1] = 'Fe ':
Break
If (ord (t [-1] [1])> = 48 and ord (t [-1] [1]) <57) or (ord (t [-1] [1])> = 65 and ord (t [-1] [1]) <70 ):
T. append (t [-1] [0] + chr (ord (t [-1] [1]) + 1 ))
Continue
If ord (t [-1] [1])> = 57 and ord (t [-1] [1]) <65:
T. append (t [-1] [0] + chr (65 ))
Continue
If ord (t [-1] [1])> = 70:
T. append (chr (ord (t [-1] [0]) + 1) + chr (48 ))
Continue
# Indexing each Chinese character in sequence
L = list ()
For st in s. decode ('gb2312 '):
St = st. encode ('utf-8 ')
I = gbk. index (st) + 1
# The section encoding starts from B0 and obtains the section encoding of Chinese characters.
T1 = '%' + t [t. index ('b0'):] [I/94]
# Index number of Chinese characters in a node
I = I-(I/94) * 94
T2 = '%' + t [i-1]
L. append (t1 + t2)
# Finally, use spaces to separate outputs
Return ''. join (l)
I must admit that my python code is not so neat.
Attach my Weibo ID: James Cooper