God-class programmer JavaScript300 line of code to get Chinese characters into pinyin, programmer javascript300
I. Current Situation of converting Chinese characters into pinyin
First, it should be said that converting Chinese characters to Pinyin is a strong demand, for example, sorting/filtering contacts by pinyin letters; for example, destination (such as ticket purchase)
By the first letter of the alphabet. However, it seems that you have never heard of any clever implementation (especially on the browser side) that requires a huge dictionary.
Go to JavaScript and check github and npm. pinyin is an excellent library for converting Chinese characters to pinyin.
As you can see from pinyinjs, both of them come with a huge dictionary.
These dictionaries are usually dozens or even hundreds of KB (some or even several MB). It still takes some courage to use them on the browser. Therefore, when we encounter the need to convert Chinese characters to pinyin, we do not blame our first response for rejecting the requirement (or implementing it on the server side ).
Now, if I tell you that you can convert Chinese characters to PinYin using 300 lines of code on the browser, isn't it unbelievable?
2. Start with the android 4.2.2 contact code
Emphasize this blog again-use the Android source code to easily convert Chinese characters to pinyin.
Today, I will share with you a solution for converting Chinese characters extracted from the source code of the Android system to PinYin. As long as there is a class and more than 560 lines of code, you can easily convert Chinese characters to pinyin, and no third-party dependencies are required.
Does it break your mindset: Is there any powerful algorithm that can discard the dictionary?
After reading the blog for the first time, I was a little disappointed and didn't have any algorithm parsing. I just introduced the hundreds of lines of code found from Android code. The second time I read the code with the idea of porting to JavaScript, I understood the principle and started the migration journey.
3. Teach you 300 lines of JavaScript code to convert Chinese characters to PinYin
First, let's go straight to the core: Why do Chinese characters have to be converted into pinyin with a massive dictionary mindset?
What is the association between Chinese characters and pinyin? For example, in the Chinese character range \ u4E00-\ u9FFF, the first one may be ha, and the last one may be ze, there is no way to associate the unicode of Chinese characters with pinyin, so there is only one huge dictionary to record the Pinyin of each Chinese character (or commonly used Chinese character.
However, we can sort all Chinese Characters in pinyin order, for example, by 'A', 'ai', 'any', 'ang ', 'ao ', 'ba ',..., 'zui ', 'zun', and 'zuo' are sorted. Therefore, we only need to remember the first Chinese Character in each Chinese Character queue with the same pinyin. Then, the required dictionary will be very small (covering all pinyin, and there are not many pinyin ).
Now, the difficulty is to sort Chinese characters by pinyin. Fortunately, the ICU/localization-related API provides this sorting API (this article may not appear if there is no convenient Sorting/comparison method ).
Therefore, this is why 300 rows can convert Chinese characters to PinYin: Intl. CollatorAPI: Intl. Collator internally implements localized string sorting. We can use Intl. Collator. prototype. compare to sort all Chinese characters by pinyin.
Boundary Chinese Character Table: records the ordered boundary points. Each Chinese character in this Chinese character table is the first Chinese character (Eachunihansisthefirstonewithinsamepinyinwhencollatoriszh_CN) in the sorted Chinese character set ).
Speaking of this, there may still be some unclear points, so the previous Code directly:
If you are interested, you can execute the node -- icu-data-dir = node_modules/full-icu script. js to check whether the Chinese character table is basically sorted by pinyin.
Note the following points:
I added "Basic" again, because the Chinese character list we obtained was not fully sorted by pinyin, and sometimes some other Chinese characters were inserted in the middle, note This when creating a boundary table.
The table obtained in the script above is the sorting of all Chinese characters. Some of the tables are different from those in HanziToPinyin. java in Android code. Therefore, the table of HanziToPinyin. java needs to be updated. (The biggest pitfall and workload from Java to JavaScript: Correct the boundary table)
I believe everyone has seen the core code: constCOLLATOR = newIntl. Collator (['zh-Hans-cn']), Intl. Collator
(Locale is China zh-Hans-CN) is the key to sorting Chinese characters by pinyin. It is the InternationalizationAPI that sorts strings in the locale-specific order.
When executing the script, you must first run the npmifull-icu command. This dependency will automatically install the missing Chinese support and prompt you how to specify the ICU data file to execute the script.
1. ICUICU is InternationalComponentsforUnicode, which provides Unicode and international support for applications.
ICUisamature, widelyusedsetofC/C ++ andjavalibrariesprovidingunicodeandglobalizationsuppforsoftwareapplications. Examples/C ++ andJavasoftware.
Additionally, the ICU provides the localized string comparison Service (UnicodeCollationAlgorithm + specific local comparison rules ):
Collation: Comparestringsaccordingtotheconventionsandstandardsofaparticularlanguage, regionorcountry. ICU 'privacy-compliance, acomprehensivesourceforthistypeofdata.
In modern browsers, ICU generally has built-in support for the user's local language. We can use it directly.
But for node. js, the ICU usually contains only one subset (usually in English), so we need to add support for Chinese. In general, you can install full-icu through npminstallfull-icu.
To install the missing Chinese support. (See node -- icu-data-dir = node_modules/full-icu above ).
2. In the previous section of IntlAPI, we should be able to clearly understand the knowledge of internationalization/localization. Here we will introduce the use of the built-in API. How can I check whether the user language and Runtime support this language? Intl. Collator. supportedLocalesOf (array | string)
Return an array containing locales that supports (do not roll back to the default locale). The parameter can be an array or string, which is the locales (bcp471_agetag) to be tested ).
Construct a Collator object and a sort string
Using Intl. Collator. prototype. compare, we can sort strings in the order specified by language. In Chinese, most of the sorting is in the pinyin order. 'A', 'ai', 'any', 'ang ', 'ao', 'ba ', 'ba', 'Ban', 'bang', 'Bao', 'bei', 'ben', 'beng', 'bi', 'bian, 'biao ', 'bin', 'bin', 'bing ', 'bo', 'Bu', 'CA', 'can ',...
This is the key to converting Chinese characters into pinyin.
4. Modify the table
Obviously, this boundary table is faulty and needs to be corrected.
We can see that most of the Chinese characters are converted into qing. It can be seen that there is a problem with the Chinese characters corresponding to qing.
Find this Chinese character, which is '\ u72c5'/'hangzhou', and add one character before and after it, ['\ u4eb2', '\ u72c5 ', '\ u828e']/["parent", "parent", "parent"]
.
Search, '\ u72c5'/'Qing' can read qing, but now read more kuang, which should be the cause of the error.
According to the sorting table for all Chinese characters obtained at first, the first Chinese Character of qing is '\ u9751'/'hangzhou '.
After the change, only 104 failed to be converted.