Note:
Today, I want to use python to call the search results calculated by Baidu box. I can see that the Chinese characters in the URL are encoded in GBK. Although I can add Chinese characters directly to the URL, I have also done a python function that converts simplified Chinese characters to GBK codes, but it is a little troublesome. I changed it today.
, "Wide" is encoded as % B9 % E3. % B9 is called a section encoding, and % E3 is a character encoding (second encoding ).
Ideas:
Collect Chinese Character http://ff.163.com/newflyff/gbk-list/ from GBK encoding page
From a practical point of view, select only the section "● GBK/2: GB2312 Chinese characters", with a total of 3755 Chinese characters.
Look at the law: the section encoding from the B0-D7, and the Chinese character encoding from the A1-FE, that is, 16*6-2 = 94, very regular.
Step 1: extract commonly used Chinese characters in python and store them in a dictionary file in sequence. Chinese characters are separated by spaces.
Step 2: According to the encoding from the A1-FE, each section 94 Chinese characters, first positioning section encoding, using the position of Chinese characters in a section positioning character encoding
Implementation:
Step 1: extract Chinese Characters
View Code
1 with open('E:/GBK.txt') as f:
2 s=f.read().splitlines().split()
The split list contains repeated section codes. To remove B0/B1 ...... Similar symbols and Chinese 0-9/A-F characters
Decode the obtained characters:
Delete these characters:
Decode all the split list first, and then
View Code
1 gbk.remove(u'\uff10')
When the characters are deleted here, a series of strings are generated using range, and then processed using notepad ++. No simple method is found.
View Code
1 for t in [u'\uff10',u'\uff11',u'\uff12',u'\uff13',u'\uff14',u'\uff15',u'\uff16',u'\uff17',u'\uff18',u'\uff19',u'\uff21',u'\uff22',u'\uff23',u'\uff24',u'\uff25',u'\uff26']:
2 gbk.remove(t)
Then remove the B0-D7 such a section encoding, while extracting character encoding also need to use similar A1-FE such encoding, so you want to generate such a list, convenient to delete and index operations.
Generate encoding series:
The row encoding is 0-9 A-F, and the column encoding is A-F
Starting from A1 increments, encountering boundary (A9-AA) to be manually processed, using the ord () and chr () functions to convert between ASCII encoding and numbers.
1 t=['A1']
2 while True:
3 if t[-1]=='FE':
4 break
5 if (ord(t[-1][1])>=48 and ord(t[-1][1])<57) or (ord(t[-1][1])>=65 and ord(t[-1][1])<70):
6 t.append(t[-1][0]+chr(ord(t[-1][1])+1))
7 continue
8 if ord(t[-1][1])>=57 and ord(t[-1][1])<65:
9 t.append(t[-1][0]+chr(65))
10 continue
11 if ord(t[-1][1])>=70:
12 t.append(chr(ord(t[-1][0])+1)+chr(48))
13 continue
The list is as follows:
With this encoding sequence, you can delete B0-D7 characters from the gbk library.
Finally, check whether there are spaces not deleted. The unicode code of the space is \ u3000.
gbk.remove(u'\u3000')
Finally, encode is encoded into a UTF-8 and saved to the dictionary file.
I put this dictionary file on the online disk, external chain: http://dl.dbank.com/c0m9selr6h
Step 2: Index Chinese Characters
Indexing is a simple algorithm, because the man in the dictionary is stored in the original order, And the 3755 Chinese Characters in GBK encoding Table 2 strictly abide by the rules of 94 Chinese Characters in each section, then we can use a simple addition to integer plus 1 to locate the section encoding, and then use the Chinese Character index-section Index * 94 to obtain the index of Chinese characters in this section, then the second encoding is located using the A1-FE list and index generated above.
Algorithm ideas are available, encoding, and debugging
Python code and comments are attached:
1 def getGBKCode (gbkFile = 'e:/GBK1.1.txt ', s = ''):
2 # The gbkFile dictionary file contains 3755 Chinese Characters
3 # s is the Chinese character to be converted, and is currently gb2312 encoding, that is, the Chinese character encoding entered from IDLE
4
5 # Read the dictionary
6 with open (gbkFile) as f:
7 gbk = f. read (). split ()
8
9 # Generating the index code for the A1-FE
10 t = ['a1']
11 while True:
12 if t [-1] = 'fe ':
13 break
14 if (ord (t [-1] [1])> = 48 and ord (t [-1] [1]) <57) or (ord (t [-1] [1])> = 65 and ord (t [-1] [1]) <70 ):
15 t. append (t [-1] [0] + chr (ord (t [-1] [1]) + 1 ))
16 continue
17 if ord (t [-1] [1])> = 57 and ord (t [-1] [1]) <65:
18 t. append (t [-1] [0] + chr (65 ))
19. continue
20 if ord (t [-1] [1])> = 70:
21 t. append (chr (ord (t [-1] [0]) + 1) + chr (48 ))
22. continue
23 # index each Chinese Character in sequence
24 l = list ()
25 for st in s. decode ('gb2312 '):
26 st = st. encode ('utf-8 ')
27 I = gbk. index (st) + 1
28 # The Section encoding starts from B0 and obtains the section encoding of Chinese characters.
29 t1 = '%' + t [t. index ('b0'):] [I/94]
30 # index number of Chinese characters in a node
31 I = I-(I/94) * 94
32 t2 = '%' + t [I-1]
33 l. append (t1 + t2)
34 # output is separated by spaces.
35 return ''. join (l)
I must admit that my python code is not so neat.
Attach my weibo ID: James Cooper
Complete. Please try again.