Conversion of Chinese characters to GBK codes based on python

Source: Internet
Author: User
Tags decode all

Note:

Today, I want to use python to call the search results calculated by Baidu box. I can see that the Chinese characters in the URL are encoded in GBK. Although I can add Chinese characters directly to the URL, I have also done a python function that converts simplified Chinese characters to GBK codes, but it is a little troublesome. I changed it today.

, "Wide" is encoded as % B9 % E3. % B9 is called a section encoding, and % E3 is a character encoding (second encoding ).

Ideas:
Collect Chinese Character http://ff.163.com/newflyff/gbk-list/ from GBK encoding page
From a practical point of view, select only the section "● GBK/2: GB2312 Chinese characters", with a total of 3755 Chinese characters.
Look at the law: the section encoding from the B0-D7, and the Chinese character encoding from the A1-FE, that is, 16*6-2 = 94, very regular.
Step 1: extract commonly used Chinese characters in python and store them in a dictionary file in sequence. Chinese characters are separated by spaces.
Step 2: According to the encoding from the A1-FE, each section 94 Chinese characters, first positioning section encoding, using the position of Chinese characters in a section positioning character encoding

Implementation:
Step 1: extract Chinese Characters

View Code

1 with open('E:/GBK.txt') as f:
2 s=f.read().splitlines().split()

The split list contains repeated section codes. To remove B0/B1 ...... Similar symbols and Chinese 0-9/A-F characters
Decode the obtained characters:


Delete these characters:
Decode all the split list first, and then

View Code

1 gbk.remove(u'\uff10')

When the characters are deleted here, a series of strings are generated using range, and then processed using notepad ++. No simple method is found.

View Code

1 for t in [u'\uff10',u'\uff11',u'\uff12',u'\uff13',u'\uff14',u'\uff15',u'\uff16',u'\uff17',u'\uff18',u'\uff19',u'\uff21',u'\uff22',u'\uff23',u'\uff24',u'\uff25',u'\uff26']:
2 gbk.remove(t)

Then remove the B0-D7 such a section encoding, while extracting character encoding also need to use similar A1-FE such encoding, so you want to generate such a list, convenient to delete and index operations.

Generate encoding series:
The row encoding is 0-9 A-F, and the column encoding is A-F
Starting from A1 increments, encountering boundary (A9-AA) to be manually processed, using the ord () and chr () functions to convert between ASCII encoding and numbers.

 1 t=['A1']
2 while True:
3 if t[-1]=='FE':
4 break
5 if (ord(t[-1][1])>=48 and ord(t[-1][1])<57) or (ord(t[-1][1])>=65 and ord(t[-1][1])<70):
6 t.append(t[-1][0]+chr(ord(t[-1][1])+1))
7 continue
8 if ord(t[-1][1])>=57 and ord(t[-1][1])<65:
9 t.append(t[-1][0]+chr(65))
10 continue
11 if ord(t[-1][1])>=70:
12 t.append(chr(ord(t[-1][0])+1)+chr(48))
13 continue

The list is as follows:

With this encoding sequence, you can delete B0-D7 characters from the gbk library.
Finally, check whether there are spaces not deleted. The unicode code of the space is \ u3000.

gbk.remove(u'\u3000')

Finally, encode is encoded into a UTF-8 and saved to the dictionary file.


I put this dictionary file on the online disk, external chain: http://dl.dbank.com/c0m9selr6h

Step 2: Index Chinese Characters

Indexing is a simple algorithm, because the man in the dictionary is stored in the original order, And the 3755 Chinese Characters in GBK encoding Table 2 strictly abide by the rules of 94 Chinese Characters in each section, then we can use a simple addition to integer plus 1 to locate the section encoding, and then use the Chinese Character index-section Index * 94 to obtain the index of Chinese characters in this section, then the second encoding is located using the A1-FE list and index generated above.
Algorithm ideas are available, encoding, and debugging
Python code and comments are attached:

1 def getGBKCode (gbkFile = 'e:/GBK1.1.txt ', s = ''):
2 # The gbkFile dictionary file contains 3755 Chinese Characters
3 # s is the Chinese character to be converted, and is currently gb2312 encoding, that is, the Chinese character encoding entered from IDLE
4
5 # Read the dictionary
6 with open (gbkFile) as f:
7 gbk = f. read (). split ()
8
9 # Generating the index code for the A1-FE
10 t = ['a1']
11 while True:
12 if t [-1] = 'fe ':
13 break
14 if (ord (t [-1] [1])> = 48 and ord (t [-1] [1]) <57) or (ord (t [-1] [1])> = 65 and ord (t [-1] [1]) <70 ):
15 t. append (t [-1] [0] + chr (ord (t [-1] [1]) + 1 ))
16 continue
17 if ord (t [-1] [1])> = 57 and ord (t [-1] [1]) <65:
18 t. append (t [-1] [0] + chr (65 ))
19. continue
20 if ord (t [-1] [1])> = 70:
21 t. append (chr (ord (t [-1] [0]) + 1) + chr (48 ))
22. continue
23 # index each Chinese Character in sequence
24 l = list ()
25 for st in s. decode ('gb2312 '):
26 st = st. encode ('utf-8 ')
27 I = gbk. index (st) + 1
28 # The Section encoding starts from B0 and obtains the section encoding of Chinese characters.
29 t1 = '%' + t [t. index ('b0'):] [I/94]
30 # index number of Chinese characters in a node
31 I = I-(I/94) * 94
32 t2 = '%' + t [I-1]
33 l. append (t1 + t2)
34 # output is separated by spaces.
35 return ''. join (l)

I must admit that my python code is not so neat.
Attach my weibo ID: James Cooper

Complete. Please try again.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.