[Open Source] pinyin sorting function library released
Reprinted please indicate the source and the author contact: http://blog.csdn.net/absurd
Contact information: Li xianjing <xianjimli at Hotmail dot com>
Last Updated: 2006-12-19
Recently, we are working on resource manager design. spec requires Chinese file names to be sorted by pinyin. So I spent some time studying the pinyin sorting problem, and then spent another two hours writing a function library. As a matter of fact, the implementation of sorting by Pinyin is very simple. Let's put it here for your reference.
We know that every character in the computer has an inner code. By default, when the computer sorts the characters, comparing the two characters is the size of the character inner code, which is no problem for English, because the inner code of English letters increases in alphabetical order. For Chinese, it is quite troublesome: first, there are many sorting methods for Chinese, such as sorting by internal code, sorting by pinyin, and sorting by strokes. You must specify the sorting method by parameters, otherwise, the computer will be sorted by internal code. Secondly, the inner code sequence of Chinese characters is different from that of Pinyin and from that of strokes. In gb2312 encoding, Chinese characters are basically sorted by pinyin (it is said that there are exceptions and it is not clear ). In GBK, it is extended based on gb2312 and compatible with all characters in gb2312, so it is not sorted by pinyin. In Unicode, the arrangement of Chinese characters seems to be less regular.
In order to solve the conflict between the internal code sequence and the user's habitual sequence (such as the Pinyin sequence), collate is required in the locale data of glibc. I took a look at the locale data provided by the glibc-2.3.5, In the locale data description of the simplified Chinese (zh_cn), the description of the sorting method is as follows:
% ISO 14651 collation Sequence
Lc_collate
Copy "2011-0650000t1"
End lc_collate
That is to say, copy the sorting method of T1. Open the file, T1, and you can see that there is no special processing about Chinese. we can infer that the default sorting method of glibc is to sort by Unicode. Therefore, glibc does not provide the pinyin sorting function, which can only be implemented by ourselves.
The sorting of Chinese characters is actually very simple. I introduced in another article how to obtain the pinyin corresponding to Chinese characters. We can extract the Pinyin of Chinese characters and then compare them. Will this method have low performance? Actually, it seems that the pinyin process may be a little slow, but one comparison function call only requires one pinyin function call, because the pinyin function is called only when their inner code is different.
If we only want to compare the Pinyin sequence of Chinese characters, we can use a simpler method instead of storing those pinyin data. We only need to sort all Chinese characters by pinyin in advance, and the offset of Chinese characters after sorting can be used as the benchmark value for comparison.
How can we find all Chinese characters?If you only need Chinese Characters in gb2312, you can follow the methods described in my other article. If you need Chinese Characters in GBK/Unicode, the numbers in GBK and Unicode correspond one to one. In Unicode, the range of Chinese characters is 0x4e00-0x9fa5, so a loop can print all Chinese characters.
How to sort by pinyin?Very simple. There are many tools to complete this function, such as WPS/word/PageMaker/Excel. Word sorting is too slow and there are limits on the number of rows. use Excel.
How to organize data?It is easy to create a ing table between Unicode and sorting offset. Considering the Comparison between Chinese characters and non-Chinese characters, we need to add the offset 0x4e00. Considering the space issue, the front of 0x4e00 is not a Chinese character. Our table only needs the size of 0x9fa5-0x4e00 + 1, so Unicode in the table must be subtracted from 0x4e00.
If you are interested, you can download it here.
(If you want to obtain and sort pinyin in the program, we recommend that you use the pinyin Method for comparison, so that the data of the two functions can be shared .)
~~ End ~~