Python-based Chinese character conversion GBK code implementation code

Source: Internet
Author: User
Tags decode all


, "Wide" is encoded as % B9 % E3. % B9 is called a section encoding, and % E3 is a character encoding (second encoding ).

Ideas:
Collect Chinese Character http://ff.163.com/newflyff/gbk-list/ from GBK encoding page
From a practical point of view, select only the section "● GBK/2: GB2312 Chinese characters", with a total of 3755 Chinese characters.
Look at the law: the section encoding from the B0-D7, and the Chinese character encoding from the A1-FE, that is, 16*6-2 = 94, very regular.
Step 1: extract commonly used Chinese characters in python and store them in a dictionary file in sequence. Chinese characters are separated by spaces.
Step 2: According to the encoding from the A1-FE, each section 94 Chinese characters, first positioning section encoding, using the position of Chinese characters in a section positioning character encoding

Implementation:
Step 1: extract Chinese CharactersCopy codeThe Code is as follows: with open ('e:/GBK.txt ') as f:
S = f. read (). splitlines (). split ()

The split list contains repeated section codes. To remove B0/B1 ...... Similar symbols and Chinese 0-9/A-F characters
Decode the obtained characters:


Delete these characters:
Decode all the split list first, and then

Copy codeThe Code is as follows: gbk. remove (U' \ uff10 ')

When the characters are deleted here, a series of strings are generated using range, and then processed using notepad ++. No simple method is found.Copy codeThe Code is as follows: for t in [U' \ uff10', U' \ uff11', U' \ uff12', U' \ uff13', U' \ uff14 ', u' \ uff15', U' \ uff16', U' \ uff17', U' \ uff18', U' \ uff19', U' \ uff21 ', u' \ uff22', U' \ uff23', U' \ uff24', U' \ uff25', U' \ uff26']:
Gbk. remove (t)

Then remove the B0-D7 such a section encoding, while extracting character encoding also need to use similar A1-FE such encoding, so you want to generate such a list, convenient to delete and index operations.

Generate encoding series:
The row encoding is 0-9 A-F, and the column encoding is A-F
Starting from A1 increments, encountering boundary (A9-AA) to be manually processed, using the ord () and chr () functions to convert between ASCII encoding and numbers.Copy codeThe Code is as follows: t = ['a1']
While True:
If t [-1] = 'fe ':
Break
If (ord (t [-1] [1])> = 48 and ord (t [-1] [1]) <57) or (ord (t [-1] [1])> = 65 and ord (t [-1] [1]) <70 ):
T. append (t [-1] [0] + chr (ord (t [-1] [1]) + 1 ))
Continue
If ord (t [-1] [1])> = 57 and ord (t [-1] [1]) <65:
T. append (t [-1] [0] + chr (65 ))
Continue
If ord (t [-1] [1])> = 70:
T. append (chr (ord (t [-1] [0]) + 1) + chr (48 ))
Continue

The list is as follows:

With this encoding sequence, you can delete B0-D7 characters from the gbk library.
Finally, check whether there are spaces not deleted. The unicode code of the space is \ u3000.
Gbk. remove (U' \ u3000 ')
Finally, encode is encoded into a UTF-8 and saved to the dictionary file.


I put this dictionary file on the online disk, external chain: http://dl.dbank.com/c0m9selr6h

Step 2: Index Chinese Characters

Indexing is a simple algorithm, because the man in the dictionary is stored in the original order, And the 3755 Chinese Characters in GBK encoding Table 2 strictly abide by the rules of 94 Chinese Characters in each section, then we can use a simple addition to integer plus 1 to locate the section encoding, and then use the Chinese Character index-section Index * 94 to obtain the index of Chinese characters in this section, then the second encoding is located using the A1-FE list and index generated above.
Algorithm ideas are available, encoding, and debugging
Python code and comments are attached:

Copy codeThe Code is as follows: def getGBKCode (gbkFile = 'e:/GBK1.1.txt ', s = ''):
# The gbkFile dictionary file contains 3755 Chinese Characters
# S is the Chinese character to be converted, and is currently gb2312 encoding, that is, the Chinese character encoding entered from IDLE

# Read dictionary
With open (gbkFile) as f:
Gbk = f. read (). split ()

# Generating index encoding for the A1-FE
T = ['a1']
While True:
If t [-1] = 'fe ':
Break
If (ord (t [-1] [1])> = 48 and ord (t [-1] [1]) <57) or (ord (t [-1] [1])> = 65 and ord (t [-1] [1]) <70 ):
T. append (t [-1] [0] + chr (ord (t [-1] [1]) + 1 ))
Continue
If ord (t [-1] [1])> = 57 and ord (t [-1] [1]) <65:
T. append (t [-1] [0] + chr (65 ))
Continue
If ord (t [-1] [1])> = 70:
T. append (chr (ord (t [-1] [0]) + 1) + chr (48 ))
Continue
# Indexing each Chinese Character in sequence
L = list ()
For st in s. decode ('gb2312 '):
St = st. encode ('utf-8 ')
I = gbk. index (st) + 1
# The Section encoding starts from B0 and obtains the section encoding of Chinese characters.
T1 = '%' + t [t. index ('b0'):] [I/94]
# Index number of Chinese characters in a node
I = I-(I/94) * 94
T2 = '%' + t [I-1]
L. append (t1 + t2)
# Finally, use spaces to separate outputs
Return ''. join (l)

I must admit that my python code is not so neat.
Attach my weibo ID: James Cooper

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.