, "Guang" code for%B9%E3, for the moment%b9 called the section encoding,%E3 for character encoding (second encoding).
Ideas:
Collect Chinese characters from GBK encoding page http://ff.163.com/newflyff/gbk-list/
From a practical point of view, only select "GBK/2: GB2312 Kanji" This section, a total of 3,755 Chinese characters.
Look at the law: The bar code from B0-D7, and for the Chinese character encoding from the A1-fe, that is, 16*6-2=94, very regular.
The first step: the commonly used Chinese characters are extracted by python, stored in a dictionary file in order, Chinese characters are separated by a space.
The second step: according to the code from the A1-FE, each section 94 characters of the law, first locate the section code, the use of Chinese characters in a section location of the position of the character encoding
Implement:
First step: extracting Chinese characters
Copy CodeThe code is as follows:
With open (' E:/gbk.txt ') as F:
S=f.read (). Splitlines (). Split ()
The resulting list has a repeating section code, to remove the b0/b1 ... Similar symbols and Chinese 0-9/a-f characters
To decode the acquired characters, see:
Remove these characters:
The split list is decoded first, and then
Copy the Code code as follows:
Gbk.remove (U ' \uff10 ')
When I delete a character here, we use range to generate a series of strings, and then we use notepad++ to deal with it, and we don't find a simple way.
Copy CodeThe code is as follows:
For t in [u ' \uff10 ', U ' \uff11 ', U ' \uff12 ', U ' \uff13 ', U ' \uff14 ', U ' \uff15 ', U ' \uff16 ', U ' \uff17 ', U ' \uff18 ', U ' \uff19 ', U ' \ Uff21 ', U ' \uff22 ', U ' \uff23 ', U ' \uff24 ', U ' \uff25 ', U ' \uff26 ']:
Gbk.remove (t)
And then remove the B0-d7 such a bar code, while extracting character encoding also use similar a1-fe such code, so want to generate such a list, easy to do delete and index operations.
Generate Encoding Series:
Line code is 0-9 a-f, column is encoded as A-f
Incrementing from A1, encountering boundaries (A9-AA) to handle manually, using the Ord () and Chr () functions to convert between ASCII encoding and numbers.
Copy CodeThe code is as follows:
t=[' A1 ']
While True:
If t[-1]== ' FE ':
Break
if (Ord (t[-1][1]) >=48 and Ord (t[-1][1]) <57) or (Ord (t[-1][1]) >=65 and Ord (t[-1][1]) <70):
T.append (T[-1][0]+CHR (Ord (t[-1][1]) +1))
Continue
If Ord (t[-1][1]) >=57 and Ord (t[-1][1]) <65:
T.append (T[-1][0]+CHR (65))
Continue
If Ord (t[-1][1]) >=70:
T.append (Chr (Ord (t[-1][0)) +1) +CHR (48))
Continue
Get the list:
With this encoding sequence, you can delete the B0-d7 character from the GBK library.
Finally check that the space is not deleted, the Unicode code of the space is \u3000
Gbk.remove (U ' \u3000 ')
Finally encode into the dictionary file into UTF-8 encoding.
I put this dictionary file on the net, the outside chain: http://dl.dbank.com/c0m9selr6h
Step Two: Index Chinese characters
The index is a simple algorithm, because the man in the dictionary is stored according to the original order, and the GBK encoding table 2 of the 3,755 Chinese characters strictly abide by the law of each of the 94 Chinese characters, then a simple divisor rounding + one positioning bar code, and then the Chinese character Index-section index *94 to get the index of Chinese characters in this section, Then use the A1-fe list and index generated above to locate the second encoding.
Algorithm ideas have, encode, and then debug
Attach Python code and comments:
Copy the Code code as follows:
def getgbkcode (gbkfile= ' e:/gbk1.1.txt ', s= '):
#gbkFile字典文件 A total of 3,755 Chinese characters
#s为要转换的汉字, for gb2312 encoding, that is, from the idle input Chinese character coding
#读入字典
With open (Gbkfile) as F:
Gbk=f.read (). Split ()
Index encoding for #生成A1-FE
t=[' A1 ']
While True:
If t[-1]== ' FE ':
Break
if (Ord (t[-1][1]) >=48 and Ord (t[-1][1]) <57) or (Ord (t[-1][1]) >=65 and Ord (t[-1][1]) <70):
T.append (T[-1][0]+CHR (Ord (t[-1][1]) +1))
Continue
If Ord (t[-1][1]) >=57 and Ord (t[-1][1]) <65:
T.append (T[-1][0]+CHR (65))
Continue
If Ord (t[-1][1]) >=70:
T.append (Chr (Ord (t[-1][0)) +1) +CHR (48))
Continue
#依次索引每个汉字
L=list ()
For St in S.decode (' gb2312 '):
St=st.encode (' Utf-8 ')
I=gbk.index (ST) +1
#小节编码从B0开始, get the bar code of Chinese characters
t1= '% ' +t[t.index (' B0 '):][i/94]
#汉字在节点中的索引号
i=i-(I/94) *94
t2= '% ' +t[i-1]
L.append (T1+T2)
#最后用空格分隔输出
Return '. Join (L)
I have to admit, my Python code is not so neat.
Attached to my Weibo ID: little Luan Cooper