Implementation code _python of Chinese character transcoding based on Python GBK codes

Source: Internet
Author: User
Tags chr ord

As shown in the figure, the "wide" code for%B9%E3, the%B9 called the Section code,%E3 for character encoding (second encoding).

Ideas:
Collecting Chinese characters from GBK encoded pages http://ff.163.com/newflyff/gbk-list/
From a practical point of view, only select "GBK/2: GB2312 Chinese Characters" This section, a total of 3,755 Chinese characters.
See Law: Section code from B0-D7, and the encoding of Chinese characters from A1-fe, that is 16*6-2=94, very regular.
The first step: the commonly used Chinese characters extracted from Python, in order to save in a dictionary file, Chinese characters separated by a space.
The second step: according to the code from the A1-FE, each section 94 characters rule, first locates the section code, uses the Chinese character in a section position to position the character code

Implement:
The first step: Extract Chinese characters
Copy Code code as follows:

With open (' E:/gbk.txt ') as F:
S=f.read (). Splitlines (). Split ()

The segmented list has a duplicate section code to remove the b0/b1 ... Similar symbols and 0-9/a-f characters in Chinese
Decode the acquired character to see:


Remove these characters:
First decode the partitioned list, and then

Copy Code code as follows:

Gbk.remove (U ' \uff10 ')

When you delete a character here, you use range to generate a series of strings, and then you handle it with notepad++, and you don't find a simple way
Copy Code code as follows:

For t in [u ' \uff10 ', U ' \uff11 ', U ' \uff12 ', U ' \uff13 ', U ' \uff14 ', U ' \uff15 ', U ' \uff16 ', U ' \uff17 ', U ' \uff18 ', U ' \uff19 ', U ' \ Uff21 ', U ' \uff22 ', U ' \uff23 ', U ' \uff24 ', U ' \uff25 ', U ' \uff26 ']:
Gbk.remove (t)

Then remove b0-d7 such as the bar code, while extracting character encoding also use similar A1-FE encoding, so want to generate such a list, easy to do delete and indexing operations.

Generate Code Series:
Line encoding is 0-9 a-f, column encoding is a-f
Incrementing from A1, encountering boundaries (A9-AA) to be manually processed, using the Ord () and Chr () functions, converting between ASCII encodings and numbers.
Copy Code code as follows:

t=[' A1 ']
While True:
If t[-1]== ' FE ':
Break
if (Ord (t[-1][1]) >=48 and Ord (t[-1][1)) <57) or (Ord (t[-1][1)) >=65 and Ord (t[-1][1)) <70):
T.append (T[-1][0]+CHR (Ord (t[-1][1)) +1)
Continue
If Ord (t[-1][1]) >=57 and Ord (t[-1][1)) <65:
T.append (T[-1][0]+CHR (65))
Continue
If Ord (t[-1][1]) >=70:
T.append (Chr (Ord (t[-1][0)) +1) +CHR (48))
Continue

The resulting list:

With this coding sequence, you can remove the B0-d7 characters from the GBK library.
Finally check that there are still spaces not removed, the space of the Unicode code is \u3000
Gbk.remove (U ' \u3000 ')
Finally, encode is saved to the dictionary file as UTF-8 encoding.


I put this dictionary file on the net, outside the chain: http://dl.dbank.com/c0m9selr6h

Step Two: Index Chinese characters

The index is a simple algorithm, because the man in the dictionary is stored in the original order, and GBK Code table 2 of the 3,755 characters strictly adhere to the rules of 94 Chinese characters, then a simple divisor rounding + one positioning section coding, and then using Chinese character index-section index *94 to get Chinese characters in this section of the index, The A1-fe list and index generated above are then used to locate the second encoding.
Algorithm ideas have, coding, and then debugging
Attach Python code and comments:

Copy Code code as follows:

def getgbkcode (gbkfile= ' e:/gbk1.1.txt ', s= '):
#gbkFile字典文件 a total of 3,755 characters
#s为要转换的汉字, for the gb2312 encoding, that is, the input from the idle encoding

#读入字典
With open (Gbkfile) as F:
Gbk=f.read (). Split ()

Index encoding for #生成A1-FE
t=[' A1 ']
While True:
If t[-1]== ' FE ':
Break
if (Ord (t[-1][1]) >=48 and Ord (t[-1][1)) <57) or (Ord (t[-1][1)) >=65 and Ord (t[-1][1)) <70):
T.append (T[-1][0]+CHR (Ord (t[-1][1)) +1)
Continue
If Ord (t[-1][1]) >=57 and Ord (t[-1][1)) <65:
T.append (T[-1][0]+CHR (65))
Continue
If Ord (t[-1][1]) >=70:
T.append (Chr (Ord (t[-1][0)) +1) +CHR (48))
Continue
#依次索引每个汉字
L=list ()
For St. in S.decode (' gb2312 '):
St=st.encode (' Utf-8 ')
I=gbk.index (ST) +1
#小节编码从B0开始, get the section code of Chinese characters
t1= '% ' +t[t.index (' B0 '):][i/94]
#汉字在节点中的索引号
i=i-(I/94) *94
t2= '% ' +t[i-1]
L.append (T1+T2)
#最后用空格分隔输出
Return '. Join (L)


I have to admit, my Python code is not so neat.
Attach my microblog ID: small Luan Cooper

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.