The ultimate scheme _javascript techniques to realize the translation of Chinese characters and pinyin into JavaScript

Last Update:2017-01-19 Source: Internet

Author: User

Tags ming

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Objective

Chinese characters and pinyin are a lot of places will encounter, this article is carefully organized and modified the online several common dictionary files and simply encapsulated a can be used directly to use the tool library. There is a need to look at the following together.

Chinese Pinyin:

Pinyin to Chinese characters:

The popularization of Chinese characters and pinyin related knowledge

Chinese Character Range

It is generally believed that the range of Chinese characters in Unicode encoding is/^[\u2E80-\u9FFF]+$/ (11904-40959), but many of them are not Chinese characters, or can be read Chinese characters, the text used in a few dictionary files of Chinese characters range is, that is/^[\u4E00-\u9FA5]+$/(19968-40869), there is also a single Chinese character 0, Its Unicode position is 12295.

Phonetic combination

Chinese characters have 21 initials: B, P, M, F, D, T, N, L, G, K, H, J, Q, x, EN, ch, sh, R, Z, C, s,24, in which the single vowel has 6: A, O, E, I, U, V, and complex vowels have 18: ai, ei , UI, AO, OU, iu, ie, ve, er, an, en, in, un, vn, ang, eng, ING, ONG, assuming the combination of consonant and vowel 22, there will be 24x21=504 combinations, and the actual situation is that some combinations are meaningless, than such as BV, Gie, ve, and so on, after removing this part, there are still 412 species remaining.

Phonetic dictionary file

The size of the dictionary file is described in turn from small to large.

Dictionary one: Pinyin first letter

The contents of the dictionary file are roughly as follows:

/**
 * Pinyin First Letter dictionary file
 *
/var pinyin_dict_firstletter = {};
Pinyin_dict_firstletter.all = "Ydyqsxmwzssxjbymgcczqpssqbycdscdqldylybssjg ...";
Pinyin_dict_firstletter.polyphone = {"19969": "DZ", "19975": "WM", "19988": "QJ", "20048": "YL",...};

The data dictionary will combine Unicode character 4E00 (19968)-9fa5 (40869) with the phonetic initials of 20,902 Chinese characters together to get a very long string, and then the Chinese characters with multiple tones (370 multiple words) are listed separately. The dictionary file size is 25kb.

The dictionary file has the advantage of small size, support pronunciation, the disadvantage is only to obtain the first letter of Pinyin.

Dictionary two: commonly used Chinese characters

The dictionary file classifies Chinese characters according to Pinyin, totaling 401 combinations, contains 6,763 common Chinese characters, and does not support pronunciation. As a result of collection from the network, the number of words is less, so the file volume is only 24KB, followed by a free look can be extended.

The dictionary file is roughly as follows (this is just an example, so only a small part is shown):

/ **
  * Conventional Pinyin data dictionary, including 6763 common Chinese characters, does not support multi-sound characters
  * /
var pinyin_dict_notone =
{
  "a": "Ah actinium",
  "ai": "Ai ah, ah, mourn, cancer, Ai, Ai, hinder love, pass, hang, 嗒, 痷, 皷, 甹 霭",
  "an": "An'anan is based on the case of the dark bank amine, and it is known as the eucalyptus quaternium",
  "ang": "Dirty",
  "ao": "Ao Ao Boaoaoaoaoaoaoaoaoaoaoaoao ao 媪 骜 骜 鏊 鳌 鏖",
  "ba": "Ba Ba Ba Pa Ba Bar Eight Scar Ba Ba Rao Target Rake Dam Pa Da Da Da Suan Shua Shua Suan Suan Shuang Pian Pun Dian Xiu",
  "bai": "Bai Bai Bai Bai Bai Bai Bai Bai Barn Barley",
  "ban": "Banban is like a slap board, and it's a mixed version, and it's half a stumbling block.
  "bang": "Bang Bang Bang Bang slings tied up with pounds, pounds, clams and pounds against burdock crabs",
  "bao": "Bao Bao Bao Hao Bao Bao Bao Bao Bao Bao Bao Bao Bao Bao Bao Bao Bao Bao Sui Bao Shui Bao Busan Baolu",
  "bo": "Bolt thin bobo bobble bobobo platinum foil bobobo neck bobo bubo bofan cymbals cymbals rumbling",
  "bei": "The Beibei Beibei Beibei Beibei Beibei Beibei Beibei Beibei Beibei Beibei Beibei Beibei 呂 傤 呗 湫 湚 餎 餙 鐾"
  "ben": "Ben Ben Ben Duan Duo"
  // omit others
};

Later found that the dictionary file there are many errors, such as the abuse of pinyin written in the Nue (correct writing should be Nve), lying written thang, and do not support pronunciation, so later I based on other dictionary files to regenerate a dictionary file in this format:

A total of 404 phonetic combinations

6,763 commonly used Chinese characters are included

Support pronunciation

does not support tone

File size is 27kb

At the same time, I based on the internet a commonly used 6,763 Chinese characters using the frequency table, the 6,763 characters in accordance with the frequency of the sorting, so you can achieve a passable JS version of the input method.

In addition, according to a more complete dictionary file found in fact there are 412 phonetic combinations, the above dictionary file does not appear in the 8 pronunciations are:

Chua,den,eng,fiao,m,kei,nun,shei

Dictionary three: The Ultimate dictionary

First, the following structure dictionary file (hereinafter referred to as dictionary a), which is not remembered, supports tones and pronunciation, which total 20,902 characters in Unicode character 4E00 (19968)-9fa5 (40869) (if 0 Then that would be 20,903 pinyin all enumerated, the dictionary file size is 280kb:

3007 (ling2)
4E00 (YI1)
4E01 (ding1,zheng1)
4E02 (KAO3)
4E03 (QI1)
4E04 (shang4,shang3)
4E05 (XIA4)
4E06 (NONE0)
4E07 (Wan4,mo4)
4E08 (Zhang4)
4E09 (san1)
4e0a (shang4,shang3)
4e0b (XIA4)
4e0c (JI1)
4e0d (BU4,BU2,FOU3)
4e0e (YU3,YU4,YU2)
4e0f (mian3)
4E10 (GAI4)
4E11 ( CHOU3)
4E12 (CHOU3)
4E13 (zhuan1)
4E14 (qie3,ju1)
...

Among them, for no or can not find the pronunciation of Chinese characters, unified labeled as NONE0, I counted, such a total of 525 Chinese characters.

In line with the goal of minimizing the size of the dictionary file, we found that all of the above files were contiguous except for the first 0 (3007), so I changed it to the following structure, and the file size was reduced from 280kb to 117kb:

var pinyin_dict_withtone = "Yi1,ding1 zheng1,kao3,qi1,shang4 shang3,xia4,none0,wan4 Mo4,zhang4,san1,shang4 shang3, Xia4,ji1,bu4 bu2 fou3,yu3 yu4 yu2,mian3,gai4,chou3,chou3,zhuan1,qie3 ju1 ... ";

The disadvantage of this dictionary file is that the tones are marked with numbers, and if you want to getxiǎo míng tóng xuéa phonetic equivalent like this one, you need an algorithm to convert the letters in the appropriate positionāáǎàōóǒòēéěèīíǐìūúǔùüǖǘǚǜńň.

I was going to try to write a conversion method for myself, later found the following dictionary file (hereinafter referred to as dictionary B), it contains 20,867 characters, but also support the tone and pronunciation, but the tone is directly marked above the letter, because it will be listed in Chinese characters, so the file volume is relatively large, 327kb, The general contents are as follows:

{"
 acridine": "Yā,ā", "Ah"
 : "ā,ē",
 "ah": "hē,a,kē",
 "Crunch": "Shà,á",
 "ah": "Ā,á,ǎ,à,a",
 "pickled": "Ā,yān", "Tong
 ": "Do", "
 actinium": "Do",
 "short": "Ǎi",
 "Love": "Ài",
 "to": "Āi,ái",
 "Hey": "Āi",
 "obstruction": "Ài", "
 Cancer": "Ái",
 "AI": "Ài",
 "Alas": "Āi,ài",
 "io": "Ǎi"
 * * Omit other * *
}

But after the comparison, found that there are 502 characters in the dictionary A is the pronunciation of none but the pronunciation of the dictionary B, there are 21 characters in the dictionary A but not in B:

{"
 兙": "Shíkè", "
 兛": "Qiān", "兝": "Fēn", "兞": "Máo", "兡": "Bǎi kè", "兣": "Lǐ", "唞"
 : "Dǒu", "
 嗧": "Jiālún", "Double Happiness": "xǐ", "Leng": "Lèng líng", "猤": "Hú", "kw": "Qián wǎ"
 ,
 "礽": "Réng",
 "膶": "Rùn", "芿": "Rèng", "蟘": "Tè", "貣": "Tè"
 ,
 "Brew": "Niàng niàn niáng",
 "醸": "Niàng",
 "鋱" ":" Tè ",
 " Terbium ":" Tè "
}

There are 7 Chinese characters that are in B but not in a:

{"
 㘄": "Lēng", "
 䉄": "Léng", "䬋": "Léng", "䮚": "Lèng", "䚏": "Lèng,lì,lìn", "㭁"
 : "Réng",
 "䖆": "Niàng"
}

So I merged the two on the basis of dictionary A, got the final dictionary file pinyin_dict_withtone.js, the file size is 122kb:

var pinyin_dict_withtone = "Yī,dīng zhēng,kǎo qiǎo yú,qī,shàng,xià,hǎn,wàn mò,zhàng,sān,shàng shǎng,xià,qíjī ...";

How to use

I put these kinds of dictionary files together and simply encapsulate the parsing method, which can be used to introduce different dictionary files according to the actual needs.

3 Packaged methods:

/**
 * Get the phonetic initials of Chinese characters
 * @param str kanji string, if encountered non-kanji then return as is
 * @param polyphone whether to support pronunciation, default false, if true, all possible combined arrays are returned
 * *
pinyinutil.getfirstletter (str, polyphone);
/**
 * To obtain pinyin according to Chinese characters, if not Chinese characters directly return the original character
 * @param str the kanji to convert
 * @param splitter delimited characters, by default separated by a space
 * @param withtone Returns whether the result contains a tone, default is
 * @param polyphone support pronunciation, default no *
*
pinyinutil.getpinyin (str, splitter, Withtone, Polyphone);
/**
 * Pinyin to Chinese characters, only support a single Chinese character, return all matching Chinese character combination
 * @param pinyin a single Chinese pinyin, can not contain the tone
 * *
pinyinutil.gethanzi (pinyin);

The following is an introduction to the use of different situations.

If you only need to get pinyin first letter

<script type= "Text/javascript" src= "pinyin_dict_firstletter.js" ></script>
<script type= "text/" JavaScript "src=" Pinyinutil.js "></script>

<script type=" Text/javascript ">
Pinyinutil.getfirstletter (' Xiao Ming classmate '); Output xmtx
pinyinutil.getfirstletter (' Greater China ', true);//Output [' Dzg ', ' Tzg ']
</script>

In particular, if you introduce the other 2 dictionary files, you can also get the phonetic initials, just say that the dictionary file is more appropriate.

If pinyin doesn't need tones

<script type= "Text/javascript" src= "pinyin_dict_noletter.js" ></script>
<script type= "text/" JavaScript "src=" Pinyinutil.js "></script>

<script type=" Text/javascript ">
Pinyinutil.getpinyin (' Xiao Ming classmate '); Output ' Xiao Ming tong xue '
pinyinutil.gethanzi (' Ming ');//Output ' Ming Ming life Ming Ming ' meditation '
</script>

If you need a tone or need to handle uncommon words

<script type= "Text/javascript" src= "pinyin_dict_withletter.js" ></script>
<script type= "text/" JavaScript "src=" Pinyinutil.js "></script>

<script type=" Text/javascript ">
Pinyinutil.getpinyin (' Xiao Ming classmate '); Output ' xiǎo míng tóng xué '
pinyinutil.getpinyin (' Xiao Ming classmate ', '-', true, true);/output [' Xiǎo-míng-tóng-xué ', ' Xiǎo-míng-tòng -xué ']
</script>

About simple Pinyin Input method

A formal input method needs to consider too many things, such as thesaurus, user input habits, and so on, here is only the implementation of a simple input method, there is no word library (although plus also can, but the web environment is not suitable for the introduction of too large files).

It is recommended to use the second dictionary file pinyin_dict_noletter.js, although the dictionary three words more, but not in accordance with the frequency of Chinese characters, some rare words in the front.

<link rel= "stylesheet" type= "Text/css" href= "Simple-input-method/simple-input-method.css" >
<input Type= "text" class= "Test-input-method"/> <script type= "text/javascript"
src= "Pinyin_dict_noletter.js" ></script>
<script type= "Text/javascript" src= "pinyinutil.js" ></script>
<script Type= "Text/javascript" src= "simple-input-method/simple-input-method.js" ></script>
<script type= " Text/javascript ">
 simpleinputmethod.init ('. Test-input-method ');
</script>

Summarize

Because the target environment of this tool class is the Web, and the Web is doomed to the size of the file can not be too large, so can not introduce too large word library file, because there is no word library support, so pronunciation unrecognized, the implementation of the Pinyin input method can not intelligently match the appropriate words. The above is the entire content of this article, I hope the content of this article for everyone's study or work can bring certain help.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More