As shown in the figure, the "wide" code for%B9%E3, the%B9 called the Section code,%E3 for character encoding (second encoding).
Ideas:
Collecting Chinese characters from GBK encoded pages http://ff.163.com/newflyff/gbk-list/
From a practical point of view, only select "GBK/2: GB2312 Chinese Characters" This section, a total of 3,755 Chinese characters.
See Law: Section code from B0-D7, and the encoding of Chinese characters from A1-fe, that is 16*6-2=94, very regular.
The first step: the commonly used Chinese characters extracted from Python, in order to save in a dictionary file, Chinese characters separated by a space.
The second step: according to the code from the A1-FE, each section 94 characters rule, first locates the section code, uses the Chinese character in a section position to position the character code
Implement:
The first step: Extract Chinese characters
Copy Code code as follows:
With open (' E:/gbk.txt ') as F:
S=f.read (). Splitlines (). Split ()
The segmented list has a duplicate section code to remove the b0/b1 ... Similar symbols and 0-9/a-f characters in Chinese
Decode the acquired character to see:
Remove these characters:
First decode the partitioned list, and then
Copy Code code as follows:
Gbk.remove (U ' \uff10 ')
When you delete a character here, you use range to generate a series of strings, and then you handle it with notepad++, and you don't find a simple way
Copy Code code as follows:
For t in [u ' \uff10 ', U ' \uff11 ', U ' \uff12 ', U ' \uff13 ', U ' \uff14 ', U ' \uff15 ', U ' \uff16 ', U ' \uff17 ', U ' \uff18 ', U ' \uff19 ', U ' \ Uff21 ', U ' \uff22 ', U ' \uff23 ', U ' \uff24 ', U ' \uff25 ', U ' \uff26 ']:
Gbk.remove (t)
Then remove b0-d7 such as the bar code, while extracting character encoding also use similar A1-FE encoding, so want to generate such a list, easy to do delete and indexing operations.
Generate Code Series:
Line encoding is 0-9 a-f, column encoding is a-f
Incrementing from A1, encountering boundaries (A9-AA) to be manually processed, using the Ord () and Chr () functions, converting between ASCII encodings and numbers.
Copy Code code as follows:
t=[' A1 ']
While True:
If t[-1]== ' FE ':
Break
if (Ord (t[-1][1]) >=48 and Ord (t[-1][1)) <57) or (Ord (t[-1][1)) >=65 and Ord (t[-1][1)) <70):
T.append (T[-1][0]+CHR (Ord (t[-1][1)) +1)
Continue
If Ord (t[-1][1]) >=57 and Ord (t[-1][1)) <65:
T.append (T[-1][0]+CHR (65))
Continue
If Ord (t[-1][1]) >=70:
T.append (Chr (Ord (t[-1][0)) +1) +CHR (48))
Continue
The resulting list:
With this coding sequence, you can remove the B0-d7 characters from the GBK library.
Finally check that there are still spaces not removed, the space of the Unicode code is \u3000
Gbk.remove (U ' \u3000 ')
Finally, encode is saved to the dictionary file as UTF-8 encoding.
I put this dictionary file on the net, outside the chain: http://dl.dbank.com/c0m9selr6h
Step Two: Index Chinese characters
The index is a simple algorithm, because the man in the dictionary is stored in the original order, and GBK Code table 2 of the 3,755 characters strictly adhere to the rules of 94 Chinese characters, then a simple divisor rounding + one positioning section coding, and then using Chinese character index-section index *94 to get Chinese characters in this section of the index, The A1-fe list and index generated above are then used to locate the second encoding.
Algorithm ideas have, coding, and then debugging
Attach Python code and comments:
Copy Code code as follows:
def getgbkcode (gbkfile= ' e:/gbk1.1.txt ', s= '):
#gbkFile字典文件 a total of 3,755 characters
#s为要转换的汉字, for the gb2312 encoding, that is, the input from the idle encoding
#读入字典
With open (Gbkfile) as F:
Gbk=f.read (). Split ()
Index encoding for #生成A1-FE
t=[' A1 ']
While True:
If t[-1]== ' FE ':
Break
if (Ord (t[-1][1]) >=48 and Ord (t[-1][1)) <57) or (Ord (t[-1][1)) >=65 and Ord (t[-1][1)) <70):
T.append (T[-1][0]+CHR (Ord (t[-1][1)) +1)
Continue
If Ord (t[-1][1]) >=57 and Ord (t[-1][1)) <65:
T.append (T[-1][0]+CHR (65))
Continue
If Ord (t[-1][1]) >=70:
T.append (Chr (Ord (t[-1][0)) +1) +CHR (48))
Continue
#依次索引每个汉字
L=list ()
For St. in S.decode (' gb2312 '):
St=st.encode (' Utf-8 ')
I=gbk.index (ST) +1
#小节编码从B0开始, get the section code of Chinese characters
t1= '% ' +t[t.index (' B0 '):][i/94]
#汉字在节点中的索引号
i=i-(I/94) *94
t2= '% ' +t[i-1]
L.append (T1+T2)
#最后用空格分隔输出
Return '. Join (L)
I have to admit, my Python code is not so neat.
Attach my microblog ID: small Luan Cooper