JavaScript Chinese and Pinyin Mutual turn the ultimate program with JS Pinyin Input method

Source: Internet
Author: User
Tags ming


Objective


On the internet about JS to achieve the Chinese character and Pinyin cross-transfer of the article a lot, but more messy, are copied to each other to copy, and some do not support polyphone, some do not support the tone, some dictionary file is too large, also such as sometimes I just need to get Chinese pinyin initials but to introduce 200KB dictionary file, Cannot meet the needs according to actual needs.



In conclusion, I have carefully collated and modified several common dictionary files on the Internet and simply encapsulated a library of tools that can be used directly.


Code and Demo Demo


GitHub Project Address: Https://github.com/liuxianan/pinyinjs



Full Demo Demo: http://demo.liuxianan.com/pinyinjs/



Kanji to Pinyin:






Pinyin to Chinese characters:





Chinese characters and pinyin-related knowledge to popularize Chinese character range


It is generally considered that the range of Chinese characters in Unicode encoding is/^[\u2E80-\u9FFF]+$/(11904-40959), but many of them are not Chinese characters, or are readable Chinese characters, the range of Chinese characters used in this paper is/^[\u4E00-\u9FA5]+$/, namely (19968-40869), There is also a separate kanji 0, whose Unicode position is 12295.


Pinyin combination


Chinese characters have 21 initials: B, P, M, F, D, T, N, L, G, K, H, J, Q, x, zh, ch, sh, R, Z, c, s,24 vowel, wherein the single vowel has 6: A, O, E, I, U, V, complex vowel has 18: ai, ei , UI, AO, OU, iu, ie, ve, er, an, en, in, un, vn, ang, eng, ING, ONG, assuming consonant and vowel 22 combinations, there will be 24x21=504 combinations, the reality is that some combinations are meaningless, than such as BV, Gie, VE, etc., after removing this part, there are still 412 kinds.


Pinyin dictionary file


According to the size of the dictionary files from small to large introduction.


Dictionary one: Pinyin first letter


The contents of the dictionary file are as follows:


/ **
  * Pinyin initials dictionary file
  * /
var pinyin_dict_firstletter = ();
pinyin_dict_firstletter.all = "YDYQSXMWZSSXJBYMGCCZQPSSQBYCDSCDQLDYLYBSSJG ...";
pinyin_dict_firstletter.polyphone = {"19969": "DZ", "19975": "WM", "19988": "QJ", "20048": "YL", ...}; 


The data dictionary stitching together the pinyin initials of the Unicode character4E00(19968)-9FA5(40869) of 20,902 Chinese characters to get a very long string, and then lists the Chinese characters with a multi-tone character (a total of 370 multi-tone words) separately. The dictionary file size is25kb.



The advantage of the dictionary file is small size, support polyphone, the disadvantage is only to get pinyin first letter.


Dictionary two: commonly used Chinese characters


This dictionary file classifies Chinese characters according to Pinyin, a total of 401 combinations, contains 6,763 commonly used Chinese characters, does not support Polyphone. As a result of the collection from the network, the inclusion of fewer words, so the file volume only 24kb, followed by a space to see if you can expand.



The dictionary file is roughly the following (this is just an example, so only a small part is shown):


/ **
 * Regular Pinyin data dictionary, which contains 6763 common Chinese characters, does not support polyphonic characters
 * /
var pinyin_dict_notone =
{
    "a": "Ah A 锕",
    "ai": "Ai Ai Ai Ai Ai Ai Ai Ai Ai Ai Ai Ai Ai Ai Ai",
    "an": "Ammonia hydrazine, eucalyptus, quail ammonium,"
    "ang": "Dirty Ang",
    "ao": "Ao Ao, Ao Ao, Ao Ao, Ao, Ao, Ao, Ao, Ao",
    "ba": "Ba Ba Ba Ba Ba Ba Ba Ba Ba Ba Ba Ba Ba Ba Ba 茇 菝 萆 捭 岜 灞 杷",
    "bai": "Baibai Baibai Baibai defeated Baibai",
    "ban": "Banban moved the board-like version of the plate to play with the companion flaps and half of it," said Ban Sakazaka,
    "bang": "Bang Bang Bang Bang Bang Bang Bang Bang Bang Bang Bang Biao Bang Crab",
    "bao": "Bao Bao Bao hail Bao Bao full Bao Bao Bao Bao Bao Bao Bao Bao spores,",
    "bo": "Peeling thin glass bobblebobobobobobobobo neck neck bobobobobo",
    "bei": "The Beibei Beibei Beibei Beibei is more and more tired and exhausted, and is betrayed,"
    "ben": "Ben Benben Stupid"
    // omit other
}; 


Later slowly found that the dictionary file there are many errors, such as the虐pinyin written innue(the correct wording should be Nve),躺writtenthang, and does not support Polyphone, so I later based on other dictionary files to regenerate a copy of the dictionary file in this format:


    • A total of 404 pinyin combinations
    • 6,763 commonly used Chinese characters are included
    • Support Polyphone
    • Tones not supported
    • File size is 27kb


At the same time, I based on the online a commonly used 6,763 Chinese characters use frequency table, the 6,763 characters according to the use of frequency to sort, so you can achieve a passable JS version of the input method.



In addition, according to a more complete dictionary file found in fact there are 412 kinds of phonetic combinations, the above dictionary file does not appear in the 8 pronunciation is:


Chua,den,eng,fiao,m,kei,nun,shei
Dictionary three: The Ultimate dictionary


First of all, from the Internet to find the following structure dictionary file (hereinafter referred to as dictionary a), specifically, which do not remember, support tones and polyphone, it will be in the Unicode character4E00(19968)-9FA5(40869) A total of 20,902 kanji (if counted 0 That is 20,903) pinyin is all enumerated, the dictionary file size is280kb:


3007 (ling2)
4E00 (yi1)
4E01 (ding1,zheng1)
4E02 (kao3)
4E03 (qi1)
4E04 (shang4,shang3)
4E05 (xia4)
4E06 (none0)
4E07 (wan4,mo4)
4E08 (zhang4)
4E09 (san1)
4E0A (shang4,shang3)
4E0B (xia4)
4E0C (ji1)
4E0D (bu4,bu2,fou3)
4E0E (yu3,yu4,yu2)
4E0F (mian3)
4E10 (gai4)
4E11 (chou3)
4E12 (chou3)
4E13 (zhuan1)
4E14 (qie3,ju1)
...


Among them, for no or can not find the pronunciation of Chinese characters, uniformly labeled asnone0, I counted a bit, such a total of 525 characters.



With the goal of minimizing the volume of the dictionary file, it was found that the above file was continuous in addition to the first 0 (3007), so I changed it to the following structure, and the file volume was280kbreduced to117kb:


{
     "兙": "shí kè",
     "兛": "qiān",
     "兝": "fēn",
     "兞": "máo",
     "兡": "bǎi kè",
     "兣": "lǐ",
     "唞": "dǒu",
     "嗧": "jiā lún",
     "囍": "xǐ",
     "堎": "lèng líng",
     "猤": "hú",
     "瓩": "qián wǎ",
     "礽": "réng",
     "膶": "rùn",
     "芿": "rèng",
     "蟘": "tè",
     "貣": "tè",
     "Stuffed": "niàng niàn niáng",
     "醸": "niàng",
     "鋱": "tè",
     "铽": "tè"
} 


The disadvantage of the dictionary file is that the tones are marked with numbers, and if you want to getxiǎo míng tóng xuéa phonetic alphabet like this, you need an algorithm to convert the letters of the appropriate positionāáǎàōóǒòēéěèīíǐìūúǔùüǖǘǚǜńň.



Originally also prepared to try to write a conversion method, and then found the following dictionary file (hereinafter referred to as dictionary B), it contains 20,867 Chinese characters, but also support tones and polyphone, but the tone is directly labeled above the letter, because it will also list the characters, so the file volume is larger, there327kb, The general contents are as follows:


{
     "Acne": "yā, ā",
     "阿": "ā, ē",
     "呵": "hē, a, kē",
     "嗄": "shà, á",
     "Ah": "ā, á, ǎ, à, a",
     "Pickled": "ā, yān",
     "锕": "ā",
     "锕": "ā",
     "Dwarf": "ǎi",
     "Love": "ài",
     "Suffer": "āi, ái",
     "Hey": "āi",
     "Obstruct": "ài",
     "Cancer": "ái",
     "艾": "ài",
     "唉": "āi, ài",
     "蔼": "ǎi"
     / * Omit other * /
} 


But after the comparison, found that there are 502 characters in dictionary A is pronounced in dictionary b, but there are 21 characters in dictionarynoneA is not in the B:


{
     "兙": "shí kè",
     "兛": "qiān",
     "兝": "fēn",
     "兞": "máo",
     "兡": "bǎi kè",
     "兣": "lǐ",
     "唞": "dǒu",
     "嗧": "jiā lún",
     "囍": "xǐ",
     "堎": "lèng líng",
     "猤": "hú",
     "瓩": "qián wǎ",
     "礽": "réng",
     "膶": "rùn",
     "芿": "rèng",
     "蟘": "tè",
     "貣": "tè",
     "Stuffed": "niàng niàn niáng",
     "醸": "niàng",
     "鋱": "tè",
     "铽": "tè"
} 


There are also 7 Chinese characters that are in B but not in a:


 
 
{ "?": "lēng", "?": "léng", "?": "léng", "?": "lèng", "?": "lèng,lì,lìn", "?": "réng", "?": "niàng" }


So I merged the two on the basis of dictionary A and got the final dictionary file pinyin_dict_withtone.js, the file size is122kb:


var pinyin_dict_withtone = "Yī,dīng zhēng,kǎo qiǎo yú,qī,shàng,xià,hǎn,wàn mò,zhàng,sān,shàng Shǎng,xià,qí Jī ... ";
How to use


I put these kinds of dictionary files together and simply encapsulate the parsing method, in use can be introduced according to the actual needs of different dictionary files.



3 ways to encapsulate the good:


/ **
  * Get the first letter of Chinese characters
  * @param str Chinese character string, if non-Chinese characters are encountered, it will be returned as is
  * @param polyphone Whether to support polyphonic characters. The default is false. If true, it will return all possible combinations
  * /
pinyinUtil.getFirstLetter (str, polyphone);
/ **
  * Get pinyin according to Chinese characters, if not Chinese characters, return original characters directly
  * @param str Chinese characters to be converted
  * @param splitter delimiters, separated by spaces by default
  * @param withtone returns whether the result contains a tone, the default is
  * @param polyphone whether to support polyphonic characters, default is not
* /
pinyinUtil.getPinyin (str, splitter, withtone, polyphone);
/ **
  * Pinyin to Chinese characters, only supports single Chinese characters, returns all matching Chinese character combinations
  * @param pinyin Pinyin of a single Chinese character, cannot contain tone
  * /
pinyinUtil.getHanzi (pinyin); 


The following are for different occasions how to use for introduction.


If you only need to get pinyin initials
<script type = "text / javascript" src = "pinyin_dict_firstletter.js"> </ script>
<script type = "text / javascript" src = "pinyinUtil.js"> </ script>

<script type = "text / javascript">
pinyinUtil.getFirstLetter (『小 茗 同学』); // Output XMTX
pinyinUtil.getFirstLetter (‘Great China’, true); // outputs [‘DZG’, ‘TZG’]
</ script>


In particular, if you introduce the other 2 dictionary files, you can also get pinyin initials, just say that the dictionary file is more appropriate.


If pinyin does not require a tone
<script type = "text / javascript" src = "pinyin_dict_noletter.js"> </ script>
<script type = "text / javascript" src = "pinyinUtil.js"> </ script>

<script type = "text / javascript">
pinyinUtil.getPinyin (‘小 茗 同学’); // outputs ‘xiao ming tong xue’
pinyinUtil.getHanzi (‘ming‘); // outputs ‘明 名 命 鸣 铭 茗 溟 酩 瞑 螟 暝 '
</ script>
If you need a tone or need to deal with uncommon characters
<script type = "text / javascript" src = "pinyin_dict_withletter.js"> </ script>
<script type = "text / javascript" src = "pinyinUtil.js"> </ script>

<script type = "text / javascript">
pinyinUtil.getPinyin (‘小 茗 同学’); // outputs ‘xiǎo míng tóng xué’
pinyinUtil.getPinyin (‘小 茗 同学’, ‘-’, true, true); // output [‘xiǎo-míng-tóng-xué’, ‘xiǎo-míng-tòng-xué’]
</ script>
About simple Pinyin Input method


A formal input method needs to consider too many things, such as thesaurus, user input habits, etc., here just to implement a simple input method, there is no word library (although plus can, but the web environment is not suitable to introduce too large files).



Recommend the use of the second dictionary filepinyin_dict_noletter.js, although the dictionary three words more, but not according to the use of Chinese characters, some rare words in front.


<link rel="stylesheet" type="text/css" href="simple-input-method/simple-input-method.css">
<input type="text" class="test-input-method"/>
<script type="text/javascript" src="pinyin_dict_noletter.js"></script>
<script type="text/javascript" src="pinyinUtil.js"></script>
<script type="text/javascript" src="simple-input-method/simple-input-method.js"></script>
<script type="text/javascript"> SimpleInputMethod.init(‘.test-input-method‘); </script>
Conclusion


Because the target environment of this tool class is the Web, and the Web is destined for the file volume is not too large, so can not introduce too large thesaurus file, because there is no support for the thesaurus, so Polyphone unrecognized, the implementation of Pinyin input method can not intelligently match the appropriate words, You can refer to the project under this NODEJS environment for support of thesaurus: Https://github.com/hotoo/pinyin



JavaScript Chinese and Pinyin Mutual turn the ultimate program with JS Pinyin Input method


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.