Python implements the Chinese sorting method,
This example describes how to sort Chinese Characters in Python. We will share this with you for your reference. The details are as follows:
Python is an hour-long string, based on the encoding value obtained by the ord function. The sort function sort based on it can easily sort numbers and English letters, because they are ordered in the encoding table.
> Print ',' <'1' <'A' True
However, it is not that easy to process Chinese characters. There are two sorting methods for Chinese: Pinyin and strokes. In the most commonly used Chinese standard character set GB2312, 3755 first-level Chinese characters are encoded in the pinyin order, the 3008 second-level Chinese characters are arranged by the first stroke,
> Print 'output' <'Salmon ', 'zeng' <'yi' True
The result is that 'taobao' and 'zeng 'are both common words, while 'taobao' and 'yi' are common words. However, in terms of strokes and pinyin, the order of the two pairs should be reversed. Later, the extended GBK and GB18030 encoding were not changed to backward compatible Chinese Character order, so the order after sort was messy.
On the other hand, the unicode encoding of Chinese is arranged by the radical radicals and number of strokes in Kangxi Dictionary, so the sorting result is different from that of GB encoding.
# Encoding = utf8char = ['zhao ', 'Qian', 'sun', 'lil', 'zhang'] char. sort () for item in char: print item. decode ('utf-8 '). encode ('gb2312 ')
The output is: "Sun, Li, Zhao Qian", and saved as gb2312 encoding.
# Encoding = gb2312char = ['zhao ', 'Qian', 'sun', 'lil', 'clerk'] char. sort () for item in char: print item
Output: "Li Qian sun Zhao Yu ". Obviously, neither of these two results is what we want. So how can we sort Chinese correctly?
First, we need to find out the sorting rules of the Chinese Dictionary: first sort by pinyin to distinguish the four voices. If the pinyin character is the same, the number of strokes is determined. Then, the number of strokes is also the same, and the number is determined by the specific stroke type in the stroke order, the Xinhua Dictionary adopts the same sequence of strokes. Therefore, the Chinese sorting not only requires a Chinese pinyin comparison table with tones, but also requires specific data.
I thought there was a ready-made module, and it was not ideal to try a few. Pyzh's conversion code only supports less than characters and has no tones. The roy code of shuimu covers more than 20 thousand characters, but it requires pysqlite to support... self-reliance ~
The most comprehensive data I have found is the unicode Chinese character encoding table Uploaded By slowwind9999 to csdn (Click hereDownload from this site.), Including the full spelling of all 20902 Chinese characters, five strokes, Zheng code, UNICODE, GBK, pen number radical, and stroke number (the pinyin part does not have a tone, and some phonetic parts are incorrect, such as TTS, TTS, for more information, see .) I extracted the stroke data and used the "Practical Chinese character to PinYin" program of the Jiangzhi key to create the unicode Chinese Character tone edition. The Chinese characters were marked with four voices, there are no differences between the 319 Japanese and Korean Chinese characters, and a slight correction (but there may still be errors) is made based on the Chinese data ). With these two comparison tables, the following work is simple.
# Create a pinyin dictionary dic_py = dict () f_py = open('py.txt ', 'R') content_py = f_py.read () lines_py = content_py.split (' \ n') n = len (lines_py) for I in range (0, n-1): word_py, mean_py = lines_py [I]. split ('\ t', 1) dic_py [word_py] = mean_pyf_py.close ()
The Processing Method of the pen-shun dictionary is the same. Although there are 20 thousand lines of text, the import speed is very fast, about 0.5 seconds. If these two files are merged and processed in a unified manner, they can be faster.
# Dictionary search function def searchdict (dic, uchar): if isinstance (uchar, str): uchar = unicode (uchar, 'utf-8 ') if uchar> = U' \ u4e00' and uchar <= U' \ u9fa5': value = dic. get (uchar. encode ('utf-8') if value = None: value = '*' else: value = uchar return value
Search for Chinese characters and convert them to UTF8 strings. Other Chinese characters are output as they are. If you need a initials, only the first character of Pinyin is output. As long as the information is accurate, it is easy to compare. Before a number is a letter, love (ai4) is prior to ang2. the bits represent the number of strokes, and the value corresponds to the stroke weight, you can directly compare the numbers to obtain the correct order. The Code is as follows:
# Compare A single character def comp_char_PY (A, B): if A = B: return-1 pyA = searchdict (dic_py, A) pyB = searchdict (dic_py, B) if pyA> pyB: return 1 elif pyA <pyB: return 0 else: bhA = eval (searchdict (dic_bh, A) bhB = eval (searchdict (dic_bh, B )) if bhA> bhB: return 1 elif bhA <bhB: return 0 else: return 'Are you kidding? '# Comparison string def comp_char (A, B): charA =. decode ('utf-8') charB = B. decode ('utf-8') n = min (len (charA), len (charB) I = 0 while I <n: dd = comp_char_PY (charA [I], charB [I]) if dd =-1: I = I + 1 if I = n: dd = len (charA)> len (charB) else: break return dd # sorting function def cnsort (nline): n = len (nline) lines = '\ n '. join (nline) for I in range (1, n): # insert method tmp = nline [I] j = I while j> 0 and comp_char (nline [J-1], tmp): nline [j] = nline [J-1] j-= 1 nline [j] = tmp return nline
Now we can sort the Chinese According to the dictionary specification.
Char = ['zhao ', 'qian', 'sun', 'lil', 'taobao'] char = cnsort (char) for item in char: print item. decode ('utf-8 '). encode ('gb2312 ')
Finally, I got the example file "Li Qian, Sun Zhao ".Click here to download.
I have not considered the situation of multiphonograph. If you want the program to automatically identify, you can add a multi-tone phrase comparison table and use context to determine. I don't know where such data exists. If there are not many polyphonic words, manual adjustment is enough.
PS: Here are two more practical online sorting tools for your reference:
Online tools for sorting by initials in Chinese and English:
Http://tools.jb51.net/aideddesign/zh_paixu
Online text inverted sorting tool:
Http://tools.jb51.net/aideddesign/flipped_txt