Implementation code of Chinese character transcoding GBK code based on Python

Source: Internet
Author: User

, "Guang" code for%B9%E3, for the moment%b9 called the section encoding,%E3 for character encoding (second encoding).

Ideas:
Collect Chinese characters from GBK encoding page http://ff.163.com/newflyff/gbk-list/
From a practical point of view, only select "GBK/2: GB2312 Kanji" This section, a total of 3,755 Chinese characters.
Look at the law: The bar code from B0-D7, and for the Chinese character encoding from the A1-fe, that is, 16*6-2=94, very regular.
The first step: the commonly used Chinese characters are extracted by python, stored in a dictionary file in order, Chinese characters are separated by a space.
The second step: according to the code from the A1-FE, each section 94 characters of the law, first locate the section code, the use of Chinese characters in a section location of the position of the character encoding

Implement:
First step: extracting Chinese characters
Copy CodeThe code is as follows:


With open (' E:/gbk.txt ') as F:
S=f.read (). Splitlines (). Split ()


The resulting list has a repeating section code, to remove the b0/b1 ... Similar symbols and Chinese 0-9/a-f characters
To decode the acquired characters, see:


Remove these characters:
The split list is decoded first, and then
Copy the Code code as follows:


Gbk.remove (U ' \uff10 ')


When I delete a character here, we use range to generate a series of strings, and then we use notepad++ to deal with it, and we don't find a simple way.
Copy CodeThe code is as follows:


For t in [u ' \uff10 ', U ' \uff11 ', U ' \uff12 ', U ' \uff13 ', U ' \uff14 ', U ' \uff15 ', U ' \uff16 ', U ' \uff17 ', U ' \uff18 ', U ' \uff19 ', U ' \ Uff21 ', U ' \uff22 ', U ' \uff23 ', U ' \uff24 ', U ' \uff25 ', U ' \uff26 ']:
Gbk.remove (t)


And then remove the B0-d7 such a bar code, while extracting character encoding also use similar a1-fe such code, so want to generate such a list, easy to do delete and index operations.

Generate Encoding Series:
Line code is 0-9 a-f, column is encoded as A-f
Incrementing from A1, encountering boundaries (A9-AA) to handle manually, using the Ord () and Chr () functions to convert between ASCII encoding and numbers.
Copy CodeThe code is as follows:


t=[' A1 ']
While True:
If t[-1]== ' FE ':
Break
if (Ord (t[-1][1]) >=48 and Ord (t[-1][1]) <57) or (Ord (t[-1][1]) >=65 and Ord (t[-1][1]) <70):
T.append (T[-1][0]+CHR (Ord (t[-1][1]) +1))
Continue
If Ord (t[-1][1]) >=57 and Ord (t[-1][1]) <65:
T.append (T[-1][0]+CHR (65))
Continue
If Ord (t[-1][1]) >=70:
T.append (Chr (Ord (t[-1][0)) +1) +CHR (48))
Continue


Get the list:

With this encoding sequence, you can delete the B0-d7 character from the GBK library.
Finally check that the space is not deleted, the Unicode code of the space is \u3000
Gbk.remove (U ' \u3000 ')
Finally encode into the dictionary file into UTF-8 encoding.


I put this dictionary file on the net, the outside chain: http://dl.dbank.com/c0m9selr6h

Step Two: Index Chinese characters

The index is a simple algorithm, because the man in the dictionary is stored according to the original order, and the GBK encoding table 2 of the 3,755 Chinese characters strictly abide by the law of each of the 94 Chinese characters, then a simple divisor rounding + one positioning bar code, and then the Chinese character Index-section index *94 to get the index of Chinese characters in this section, Then use the A1-fe list and index generated above to locate the second encoding.
Algorithm ideas have, encode, and then debug
Attach Python code and comments:
Copy the Code code as follows:


def getgbkcode (gbkfile= ' e:/gbk1.1.txt ', s= '):
#gbkFile字典文件 A total of 3,755 Chinese characters
#s为要转换的汉字, for gb2312 encoding, that is, from the idle input Chinese character coding

#读入字典
With open (Gbkfile) as F:
Gbk=f.read (). Split ()

Index encoding for #生成A1-FE
t=[' A1 ']
While True:
If t[-1]== ' FE ':
Break
if (Ord (t[-1][1]) >=48 and Ord (t[-1][1]) <57) or (Ord (t[-1][1]) >=65 and Ord (t[-1][1]) <70):
T.append (T[-1][0]+CHR (Ord (t[-1][1]) +1))
Continue
If Ord (t[-1][1]) >=57 and Ord (t[-1][1]) <65:
T.append (T[-1][0]+CHR (65))
Continue
If Ord (t[-1][1]) >=70:
T.append (Chr (Ord (t[-1][0)) +1) +CHR (48))
Continue
#依次索引每个汉字
L=list ()
For St in S.decode (' gb2312 '):
St=st.encode (' Utf-8 ')
I=gbk.index (ST) +1
#小节编码从B0开始, get the bar code of Chinese characters
t1= '% ' +t[t.index (' B0 '):][i/94]
#汉字在节点中的索引号
i=i-(I/94) *94
t2= '% ' +t[i-1]
L.append (T1+T2)
#最后用空格分隔输出
Return '. Join (L)



I have to admit, my Python code is not so neat.
Attached to my Weibo ID: little Luan Cooper

  • Contact Us

    The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

    If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

    A Free Trial That Lets You Build Big!

    Start building with 50+ products and up to 12 months usage for Elastic Compute Service

    • Sales Support

      1 on 1 presale consultation

    • After-Sales Support

      24/7 Technical Support 6 Free Tickets per Quarter Faster Response

    • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.